ci5b00570_si_001.zip (479.82 kB)
Download fileImproved Chemical Structure–Activity Modeling Through Data Augmentation
dataset
posted on 28.12.2015, 00:00 by Isidro Cortes-Ciriano, Andreas BenderExtending
the original training data with simulated unobserved
data points has proven powerful to increase both the generalization
ability of predictive models and their robustness against changes
in the structure of data (e.g., systematic drifts in the response
variable) in diverse areas such as the analysis of spectroscopic data
or the detection of conserved domains in protein sequences. In this
contribution, we explore the effect of data augmentation in the predictive
power of QSAR models, quantified by the RMSE values on the test set.
We collected 8 diverse data sets from the literature and ChEMBL version
19 reporting compound activity as pIC50 values. The original
training data were replicated (i.e., augmented) N times (N ∈ 0, 1, 2, 4, 6, 8, 10), and these
replications were perturbed with Gaussian noise (μ = 0, σ
= σnoise) on either (i) the pIC50 values,
(ii) the compound descriptors, (iii) both the compound descriptors
and the pIC50 values, or (iv) none of them. The effect
of data augmentation was evaluated across three different algorithms
(RF, GBM, and SVM radial) and two descriptor types (Morgan fingerprints
and physicochemical-property-based descriptors). The influence of
all factor levels was analyzed with a balanced fixed-effect full-factorial
experiment. Overall, data augmentation constantly led to increased
predictive power on the test set by 10–15%. Injecting noise
on (i) compound descriptors or on (ii) both compound descriptors and
pIC50 values led to the highest drop of RMSEtest values (from 0.67–0.72 to 0.60–0.63 pIC50 units). The maximum increase in predictive power provided by data
augmentation is reached when the training data is replicated one time.
Therefore, extending the original training data with one perturbed
repetition thereof represents a reasonable trade-off between the increased
performance of the models and the computational cost of data augmentation,
namely increase of (i) model complexity due to the need for optimizing
σnoise and (ii) the number of training examples.