American Chemical Society
ma3c01378_si_001.pdf (5.2 MB)

Machine Translation between BigSMILES Line Notation and Chemical Structure Diagrams

Download (5.2 MB)
journal contribution
posted on 2023-12-18, 20:00 authored by Michael E. Deagen, Bérenger Dalle-Cort, Nathan J. Rebello, Tzyy-Shyang Lin, Dylan J. Walsh, Bradley D. Olsen
The representation of chemical structure forms a core component of polymer science, yet the chemical structure diagrams used to convey such information lack the machine processability vital for automating analysis, managing abundant data, and harnessing the potential of informatics. On the other hand, the usage of BigSMILES languagea machine-readable representation of polymer chemical structurerequires specialized knowledge of its grammar and syntax. Here, the algorithmic translation between chemical structure diagrams and BigSMILES line notation is demonstrated, providing seamless interconversion to and from the lingua franca of polymer chemists across a broad array of polymer architectures (e.g., copolymers, graft and segmented polymers, star polymers, macrocycles, networks, ladder polymers). Serialization from structure diagram into BigSMILES line notation is accomplished by parsing the contents of a connection table and iteratively assembling string representations of the molecular graph and its substructures. Deserialization from BigSMILES line notation into a structure diagram involves parsing the line notation string into a stochastic graph representation, from which a valid graph traversal defines a representative sequence of substructural units comprising the connection table (i.e., structure diagram). These algorithms were validated through round-trip translation on a curated set of 300 polymer structure diagrams, demonstrating semantic preservation of the molecular graph in over 99% of cases and visually equivalent structure diagrams in 38% of cases. The 2D layout, an isometry of the atomic coordinates generated by the CoordGen library within RDKit, shows the applicability of readily available atomic layout generation algorithms while revealing specific areas in which to improve these layout algorithms for polymersfor example, 60% of test cases could be rectified by orienting backbone atoms in an extended configuration along a horizontal axis. Implemented in JavaScript, this software offers facile integration with web-based resources and forms an essential interface between informatics and the broader polymer research community. By enabling humans and machines to process vast amounts of polymer chemical structural data, this work aims to democratize access to polymer informatics and foster increasingly interdisciplinary approaches to polymer research.