posted on 2024-02-06, 19:37authored bySon Gyo Jung, Guwon Jung, Jacqueline M. Cole
Machine learning (ML) methods can train a model to predict
material
properties by exploiting patterns in materials databases that arise
from structure–property relationships. However, the importance
of ML-based feature analysis and selection is often neglected when
creating such models. Such analysis and selection are especially important
when dealing with multifidelity data because they afford a complex
feature space. This work shows how a gradient-boosted statistical
feature-selection workflow can be used to train predictive models
that classify materials by their metallicity and predict their band
gap against experimental measurements, as well as computational data
that are derived from electronic-structure calculations. These models
are fine-tuned via Bayesian optimization, using solely the features
that are derived from chemical compositions of the materials data.
We test these models against experimental, computational, and a combination
of experimental and computational data. We find that the multifidelity
modeling option can reduce the number of features required to train
a model. The performance of our workflow is benchmarked against state-of-the-art
algorithms, the results of which demonstrate that our approach is
either comparable to or superior to them. The classification model
realized an accuracy score of 0.943, a macro-averaged F1-score of
0.940, area under the curve of the receiver operating characteristic
curve of 0.985, and an average precision of 0.977, while the regression
model achieved a mean absolute error of 0.246, a root-mean squared
error of 0.402, and R2 of 0.937. This
illustrates the efficacy of our modeling approach and highlights the
importance of thorough feature analysis and judicious selection over
a “black-box” approach to feature engineering in ML-based
modeling.