# Detection of Outliers in Projection-Based Modeling

journal contribution

posted on 13.01.2020, 15:36 by Oxana Ye. Rodionova, Alexey L. PomerantsevPreviously, we have
introduced an approach for calculation of the
full object distance in the frame of Principal Component Analysis
that can be applied to data exploration and classification. Now, a
similar approach has been developed for regression problems in which
a total distance can be calculated for every sample in projection
modeling. Based on the total distance, a threshold for outlier detection
has been developed by means of a data-driven estimation of the degrees
of freedom and scaling parameters for the partial distances in the
projection models. A joint threshold is used as a basis for a sequential
outlier detection procedure. The iterative nature of the procedure
helps to overcome masking effect in outliers, and a backward step
eliminates swamping effects. Two real examples are used for illustration.
The first dataset represents capsules filled with specially prepared
mixtures of an active pharmaceutical ingredient and a number of excipients.
This dataset is used to illustrate the behavior of possible outliers
in the regression model and their corresponding locations in the X-
and XY-distance plots. The second dataset consists of spectra of 135
whole wheat samples used for the prediction of protein, gluten, and
moisture content. This dataset is used for a demonstration of the
step-by-step application of the sequential procedure for outlier detection.