pr9b00227_si_001.pdf (391.48 kB)
Evaluation of Multivariate Classification Models for Analyzing NMR Metabolomics Data
journal contribution
posted on 2019-08-22, 16:35 authored by Thao Vu, Parker Siemek, Fatema Bhinderwala, Yuhang Xu, Robert PowersAnalytical
techniques such as NMR and mass spectrometry can generate
large metabolomics data sets containing thousands of spectral features
derived from numerous biological observations. Multivariate data analysis
is routinely used to uncover the underlying biological information
contained within these large metabolomics data sets. This is typically
accomplished by classifying the observations into groups (e.g., control
versus treated) and by identifying associated discriminating features.
There are a variety of classification models to select from, which
include some well-established techniques (e.g., principal component
analysis [PCA], orthogonal projection to latent structure [OPLS],
or partial least-squares projection to latent structures [PLS]) and
newly emerging machine learning algorithms (e.g., support vector machines
or random forests). However, it is unclear which classification model,
if any, is an optimal choice for the analysis of metabolomics data.
Herein, we present a comprehensive evaluation of five common classification
models routinely employed in the metabolomics field and that are also
currently available in our MVAPACK metabolomics software package.
Simulated and experimental NMR data sets with various levels of group
separation were used to evaluate each model. Model performance was
assessed by classification accuracy rate, by the area under a receiver
operating characteristic (AUROC) curve, and by the identification
of true discriminating features. Our findings suggest that the five
classification models perform equally well with robust data sets.
Only when the models are stressed with subtle data set differences
does OPLS emerge as the best-performing model. OPLS maintained a high-prediction
accuracy rate and a large area under the ROC curve while yielding
loadings closest to the true loadings with limited group separations.