Extreme Search Speed-Ups in JChem PostgreSQL Cartridge 2.7

Posted on May 19th, 2017 at 12:46 pm by Krisztina Vajda, Róbert Wagner and Tamás Csizmazia

  • New index type: sortedchemindex
  • Massive speed-up in duplicate and similarity searches using sortedchemindex
  • Extra speed-ups can be achieved in substructure search with top hits
  • Up to 60 times speed-up in case of typical joined queries

Results of duplicate (DUP) and similarity (SIM) search benchmarks in JChem Oracle Cartridge (JOC) version 17.3.6, JChem PostgreSQL Cartridge (JPC) 2.4 and JPC 2.7 (see technical details in footnote1). All tables are indexed, and JPC 2.7 uses the new sortedchemindex type (by using this type of index most similar hits are displayed first).

Please note that in case of these search types the speed is more-or-less query independent.

Substructure search benchmarks are run on “rare” and “frequent”2 query sets where the hits are ordered by relevance3.

In case of many hits it may be worth retrieving only the first hits (top 500 in the benchmark).

Joined queries4 can also speed up, depending on the decision of the PostgreSQL execution planner.

Click to the demo site and try this out now.


Footnotes

1. [Target set: 8M structures, PubChem, Query set: small fragments and druglike molecules, Similarity search: retrieved only the 100 most similar structures]
2. [rare: few hits; frequent: many possible hits after screening phase, many returned hits]
3. [ Starting from JPC 2.7, the result set can be ordered directly by the chemical structures – most relevant hits come first.]
4. [Benchmark queries: JOC – select count(*) from pbch_8m where jc_compare(mol, ‘Clc1ccccc1’, ‘t:s’) = 1 and molweight < 120;
JPC – select count(*) from pbch_8m where ‘Clc1ccccc1’ |<| mol and molweight < 120;]