American Chemical Society
Browse
- No file added yet -

Integrating Data Mining and Natural Language Processing to Construct a Melting Point Database for Organometallic Compounds

Download (1.13 MB)
journal contribution
posted on 2024-10-01, 14:34 authored by Jinyoung Jeong, Taehyun Park, JunHo Song, Seungpyo Kang, Joonghee Won, Jungim Han, Kyoungmin Min
As semiconductor devices are miniaturized, the importance of atomic layer deposition (ALD) technology is growing. When designing ALD precursors, it is important to consider the melting point, because the precursors should have melting points lower than the process temperature. However, obtaining melting point data is challenging due to experimental sensitivity and high computational costs. As a result, a comprehensive and well-organized database for the melting point of the OMCs has not been fully reported yet. Therefore, in this study, we constructed a database of melting points for 1,845 OMCs, including 58 metal and 6 metalloid elements. The database contains CAS numbers, molecular formulas, and structural information and was constructed through automatic extraction and systematic curation. The melting point information was extracted using two methods: 1) 1,434 materials from 11 chemical vendor databases and 2) 411 materials identified through natural language processing (NLP) techniques with an accuracy of 86.3%, based on 2,096 scientific papers published over the past 29 years. In our database, the OMCs contain up to around 250 atoms and have melting points that range from −170 to 1610 °C. The main source is the Chemsrc database, accounting for 607 materials (32.9%), and Fe is the most common central metal or metalloid element (15.0%), followed by Si (11.6%) and B (6.7%). To validate the utilization of the constructed database, a multimodal neural network model was developed integrating graph-based and feature-based information as descriptors to predict the melting points of the OMCs but moderate performance. We believe the current approach reduces the time and cost associated with hand-operated data collection and processing, contributing to effective screening of potentially promising ALD precursors and providing crucial information for the advancement of the semiconductor industry.

History