Quantitative Structure–Retention Relationship Models To Support Nontarget High-Resolution Mass Spectrometric Screening of Emerging Contaminants in Environmental Samples
datasetposted on 07.06.2016, 00:00 by Reza Aalizadeh, Nikolaos S. Thomaidis, Anna A. Bletsou, Pablo Gago-Ferrero
Over the past decade, the application of liquid chromatography-high resolution mass spectroscopy (LC-HRMS) has been growing extensively due to its ability to analyze a wide range of suspected and unknown compounds in environmental samples. However, various criteria, such as mass accuracy and isotopic pattern of the precursor ion, MS/MS spectra evaluation, and retention time plausibility, should be met to reach a certain identification confidence. In this context, a comprehensive workflow based on computational tools was developed to understand the retention time behavior of a large number of compounds belonging to emerging contaminants. Two extensive data sets were built for two chromatographic systems, one for positive and one for negative electrospray ionization mode, containing information for the retention time of 528 and 298 compounds, respectively, to expand the applicability domain of the developed models. Then, the data sets were split into training and test set, employing k-nearest neighborhood clustering, to build and validate the models’ internal and external prediction ability. The best subset of molecular descriptors was selected using genetic algorithms. Multiple linear regression, artificial neural networks, and support vector machines were used to correlate the selected descriptors with the experimental retention times. Several validation techniques were used, including Golbraikh–Tropsha acceptable model criteria, Euclidean based applicability domain, modified correlation coefficient (rm2), and concordance correlation coefficient values, to measure the accuracy and precision of the models. The best linear and nonlinear models for each data set were derived and used to predict the retention time of suspect compounds of a wide-scope survey, as the evaluation data set. For the efficient outlier detection and interpretation of the origin of the prediction error, a novel procedure and tool was developed and applied, enabling us to identify if the suspect compound was in the applicability domain or not.