Detection of Outliers in Projection-Based Modeling
journal contributionposted on 13.01.2020, 15:36 by Oxana Ye. Rodionova, Alexey L. Pomerantsev
Previously, we have introduced an approach for calculation of the full object distance in the frame of Principal Component Analysis that can be applied to data exploration and classification. Now, a similar approach has been developed for regression problems in which a total distance can be calculated for every sample in projection modeling. Based on the total distance, a threshold for outlier detection has been developed by means of a data-driven estimation of the degrees of freedom and scaling parameters for the partial distances in the projection models. A joint threshold is used as a basis for a sequential outlier detection procedure. The iterative nature of the procedure helps to overcome masking effect in outliers, and a backward step eliminates swamping effects. Two real examples are used for illustration. The first dataset represents capsules filled with specially prepared mixtures of an active pharmaceutical ingredient and a number of excipients. This dataset is used to illustrate the behavior of possible outliers in the regression model and their corresponding locations in the X- and XY-distance plots. The second dataset consists of spectra of 135 whole wheat samples used for the prediction of protein, gluten, and moisture content. This dataset is used for a demonstration of the step-by-step application of the sequential procedure for outlier detection.