ci9500647_si_001.pdf (640.63 kB)

Performance Enhancement of Vector-Based Search Systems:  Application to Carbon-13 Nuclear Magnetic Resonance Chemical Shift Prediction

Download (640.63 kB)
journal contribution
posted on 24.01.1996, 00:00 by Robert C. Schweitzer, Gary W. Small
A database partitioning algorithm is described for use in increasing the speed of vector-based search systems. When implemented with a library of n-dimensional vectors to be searched, the algorithm subdivides the n-dimensional data space into multiple levels of n-dimensional boxes. The vectors lying within each box are catalogued, and the search for the nearest matches to a target vector is reduced to a search of only the box in the data space containing the target vector and the immediate neighboring boxes. The use of multiple layers of boxes serves to reduce the computer memory requirements for implementing the partitioning scheme while still allowing the number of vector comparisons in the search to be minimized. In the work presented, this algorithm is applied to the prediction of carbon-13 nuclear magnetic resonance chemical shifts. A database retrieval system for chemical shifts is implemented based on encoding the chemical environments of carbon atoms into a seven-dimensional vector representation. The chemical shifts of target carbon atoms are estimated by performing a search of the database for carbon environments that match the targets. The performance enhancement afforded by the partitioning algorithm as well as the search accuracy is found to depend on three design variables. Employing a library of 133 533 environment vectors, these variables are optimized for the chemical shift prediction application by performing searches on sets of 39 074 and 3900 test carbons. Based on the relative number of vector comparisons required to implement the searches with and without partitioning, speed increases by factors of 34.1 and 19.9 are realized for searches in which the 5 and 200 nearest matches are found, respectively.