posted on 2020-05-01, 21:43authored byDavid
C. L. Handler, Paul A. Haynes
We
randomly selected 100 journal articles published in five proteomics
journals in 2019 and manually examined each of them against a set
of 13 criteria concerning the statistical analyses used, all of which
were based on items mentioned in the journals’ instructions
to authors. This included questions such as whether a pilot study
was conducted and whether false discovery rate calculation was employed
at either the quantitation or identification stage. These data were
then transformed to binary inputs, analyzed via machine learning algorithms,
and classified accordingly, with the aim of determining if clusters
of data existed for specific journals or if certain statistical measures
correlated with each other. We applied a variety of classification
methods including principal component analysis decomposition, agglomerative
clustering, and multinomial and Bernoulli naïve Bayes classification
and found that none of these could readily determine journal identity
given extracted statistical features. Logistic regression was useful
in determining high correlative potential between statistical features
such as false discovery rate criteria and multiple testing corrections
methods, but was similarly ineffective at determining correlations
between statistical features and specific journals. This meta-analysis
highlights that there is a very wide variety of approaches being used
in statistical analysis of proteomics data, many of which do not conform
to published journal guidelines, and that contrary to implicit assumptions
in the field there are no clear correlations between statistical methods
and specific journals.