ci050329+_si_001.xls (251 kB)
Download fileHarvesting Chemical Information from the Internet Using a Distributed Approach: ChemXtreme
dataset
posted on 2006-03-27, 00:00 authored by M. Karthikeyan, S. Krishnan, Anil Kumar Pandey, Andreas BenderThe Internet is a comprehensive resource of chemical information which is at the same time largely
unstructured. It provides a wealth of scientific information such as experimental data and requires a suitable
automated data mining and analysis tool for its meaningful exploration. The Java based software presented
here, ChemXtreme, is developed for harvesting chemical information from the Internet employing the Google
API in combination with a distributed client/server text analysis architecture based on JavaRMI. It represents
the first and until now the only toolkit for automated structured data retrieval from the Internet which is
itself open source. ChemXtreme employs the “search the search engine” strategy, where the URLs returned
from the search engine are analyzed further via textual pattern analysis. This process resembles the manual
analysis of the hit list, where relevant data are captured and, by means of human intervention, are mined
into a format suitable for further analysis. ChemXtreme on the other hand transforms chemical information
automatically into a structured format suitable for storage in databases and further analysis and also provides
links to the original information source. The query data retrieved from the search engine by the server is
encoded, encrypted, and compressed and then sent to all the participating active clients in the network for
parsing. Relevant information identified by the clients on the retrieved Web sites is sent back to the server,
verified, and added to the database for data mining and further analysis. The distributed further analysis of
URLs in a client/server architecture scales very favorably, thus producing only minimal overhead.