Bacterial promoters play a crucial role in gene expression
by serving
as docking sites for the transcription initiation machinery. However,
accurately identifying promoter regions in bacterial genomes remains
a challenge due to their diverse architecture and variations. In this
study, we propose MLDSPP (Machine Learning and Duplex Stability based
Promoter prediction in Prokaryotes), a machine learning-based promoter
prediction tool, to comprehensively screen bacterial promoter regions
in 12 diverse genomes. We leveraged biologically relevant and informative
DNA structural properties, such as DNA duplex stability and base stacking,
and state-of-the-art machine learning (ML) strategies to gain insights
into promoter characteristics. We evaluated several machine learning
models, including Support Vector Machines, Random Forests, and XGBoost,
and assessed their performance using accuracy, precision, recall,
specificity, F1 score, and MCC metrics. Our findings reveal that XGBoost
outperformed other models and current state-of-the-art promoter prediction
tools, namely Sigma70pred and iPromoter2L, achieving F1-scores >95%
in most systems. Significantly, the use of one-hot encoding for representing
nucleotide sequences complements these structural features, enhancing
our XGBoost model’s predictive capabilities. To address the
challenge of model interpretability, we incorporated explainable AI
techniques using Shapley values. This enhancement allows for a better
understanding and interpretation of the predictions of our model.
In conclusion, our study presents MLDSPP as a novel, generic tool
for predicting promoter regions in bacteria, utilizing original downstream
sequences as nonpromoter controls. This tool has the potential to
significantly advance the field of bacterial genomics and contribute
to our understanding of gene regulation in diverse bacterial systems.