Composite Machine Learning Strategy for Natural Products Taxonomical Classification and Structural Insights

Page view(s)
11
Checked on Aug 10, 2025
Composite Machine Learning Strategy for Natural Products Taxonomical Classification and Structural Insights
Title:
Composite Machine Learning Strategy for Natural Products Taxonomical Classification and Structural Insights
Journal Title:
Digital Discovery
Publication Date:
23 September 2024
Citation:
Xu, Q., Tan, A. K. X., Guo, L., Lim, Y. H., Tay, D. W. P., Ang, S. J. (2024). Composite Machine Learning Strategy for Natural Products Taxonomical Classification and Structural Insights. Digital Discovery. https://doi.org/10.1039/d4dd00155a
Abstract:
Taxonomical classification of natural products (NPs) can assist in genomic and phylogenetic analysis of source organisms and facilitate streamlining of bioprospecting efforts. Here, a composite machine learning strategy marrying graph convolutional neural networks (GCNNs) and eXteme Gradient boosting (XGBoost) is proposed and validated for taxonomical classification of NPs in five kingdoms (Animalia, Bacteria, Chromista, Fungi, and Plantae). Our composite model, trained on 133,092 NPs from the LOTUS database, achieved five-fold cross-validated classification accuracy of 97.4%. When employed to classify out-of-sample NPs from the NP Atlas database, accuracies of 82.8% for bacteria and 86.6% for fungi were obtained. Dimensionality-reduced representations of the molecular embeddings from our composite model revealed distinct clusters of natural products that suggest a basis for enhanced classification performance. The top critical substructures from the NPs of each kingdom were also identified and compared to provide insights on structure-taxonomy relationships. Overall, this study showcases the potential of composite machine learning models for robust taxonomical classification of NPs, which can streamline discovery of NPs.
License type:
Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
Funding Info:
This research / project is supported by the Agency for Science, Technology and Research (A*STAR), Singapore - SIBER 2.0
Grant Reference no. : C233017006

This research / project is supported by the National Research Foundation, Singapore - SGUnited Jobs Initiative
Grant Reference no. : P20J3d1014
Description:
ISSN:
2635-098X
Files uploaded:

File Size Format Action
d4dd00155a.pdf 928.19 KB PDF Open