posted on 2023-10-11, 20:46authored byAaron R. Flanagan, Frank G. Glavin
Raman spectra are examples of high dimensional data that
can often
be limited in the number of samples. This is a primary concern when
Deep Learning frameworks are developed for tasks such as chemical
species identification, quantification, and diagnostics. Open-source
data are difficult to obtain and often sparse; furthermore, the collecting
and curating of new spectra require expertise and resources. Deep
generative modeling utilizes Deep Learning architectures to approximate
high dimensional distributions and aims to generate realistic synthetic
data. The evaluation of the data and the performance of the deep models
is usually conducted on a per-task basis and provides no indication
of an increase to robustness, or generalization, on a wider scale.
In this study, we compare the benefits and limitations of a standard
statistical approach to data synthesis (weighted blending) with a popular deep generative model, the Variational Autoencoder. Two binary data sets are divided into 3-fold to simulate small,
limited samples. Synthetic data distributions are created per fold
using the two methods and then augmented into the training of two
Deep Learning algorithms, a Convolutional Neural Network and a Fully-Connected Neural Network. The goal
of this study is to observe the trends in learning as synthetic data
are continually augmented to the training data in increasing batches.
To determine the impact of each synthetic method, Principal
Component Analysis and the discrete Fréchet
distance are implemented to visualize and measure the distance
between the source and synthetic distributions along with the Machine
Learning metric balanced accuracy for evaluating
performance on imbalanced data.