Efficient simultaneous matching of multiple SMARTS using the ChemAxon toolkits
The use of SMARTS patterns for pattern matching has become ubiquitous in cheminformatics, and efficient implementations exist for identifying one or more instances of a user-defined substructure in a molecular graph. However, a very common usage of this functionality is in applications that test a number of patterns against each molecule. Examples include filtering of desirable/undesirable properties, atom typing, descriptor and physical property calculation, pharmacophore perception, feature-based fingerprint generation and IUPAC name generation. In these use-cases, current practice is typically to match each of the (SMARTS) patterns independently and sequentially.
This work describes significantly more efficient algorithms for matching multiple patterns simultaneously. Much like chemical database search systems use fingerprint pre-screens to optimize searching a single pattern against multiple molecules; pre-processing analysis and pattern compilation can be used to optimize matching multiple patterns against a single target molecule. The process takes a set of patterns and generates a Java class that performs this matching using the ChemAxon toolkit. This approach, which often makes the match run-time independent of the number of patterns, enabling software applications that require matching of thousands or tens of thousands of patterns.
Performance figures will be presented that show using these methods with the ChemAxon toolkits significantly improve real world applications, such as generation of 166-bit MACCS keys, over traditional sequential methods. Possible applications to patent indexing and searching will also be discussed.