JChem Base Performance Information

 

Scalability

How scalable is JChem Cartridge / JChem Base?

Lutz Weber's presentation about very large databases (200 million structures) and 2D- and 3D-Similarity Searching.

What is the maximum number of structures for JChem Base / JChem Cartridge?

In a single table the maximum number of rows is 231-1 (2,147,483,647).
We may increase this in the future if needed.
There is no practical limit for the number of tables.


Hardware requirements

How can I estimate hardware requirements for a JChem database?

  1. Memory:
    JChem may use significant amount of memory for building a structure cache in memory.
    Please see this topic on how to estimate the size of the structure cache.

    The total Java memory consumption (heap size) consists of:

    • The size of the structure cache
    • Processing memory for JChem Base: we recommend an additional 200 MB
    • Processing memory for your Java application (built on JChem Base)

    To determine the total amount of RAM requirement you should add:

    • Java memory consumption (detailed above)
    • Other programs running on that computer
    • Memory requirement for the operating system

    64 bit systems: A single Java process cannot allocate more than around 2 GB on 32 bit systems. If your Java memory needs will exceed this limit a 64 bit system is recommended (including hardware, operating system and Java).

    Please also visit this link on how to enable the Java Virtual Machine to allocate the necessary amount of memory

  2. Disk space:

    Apart from the number of structures it also depends on the format of the input file, and type of RDBMS engine used by JChem Base.

    A benchmark result with 1 million structures (the NCI dataset was multiplied for the test), the RDBMS was Oracle:

    Format Size of JChem table Original file size
    SDF 1.2 GB 3.7 GB
    SMILES 270 MB 50 MB

    Note: JChem compresses SDF in the cd_structure field by default (in the case of the NCI dataset to roughly about 4.4 times smaller). This can be disabled (e.g. if they need to be displayed directly by non-ChemAxon tools), but the storage size increases in this case.

  3. Processing power:

    Comparing the performance of different hardware architectures is a complex topic and not the subject of this FAQ.

    Some quick facts:

    • JChem automatically uses all processors during a search
    • The search speed scales well with the number of processors
    • The search time is directly proportional to the size of the table (assuming there is no constraint on the number of hits - see below)

    The following benchmarks can be used as a starting point:

    For the same database the search time greatly depends on the type of query structure. For a very general query (e.g. benzene) there will be a lot of hits, meaning longer execution time, while more specific queries run very fast on the same large database.

    The search time consists of
    • Screening time : this is a quick pre-filtering using fingerprints
    • Graph search: slower, but only performed for the screened compounds

    The following can be stated:

    • The screening time is directly proportional to the size of the database table.
    • The graph search time is directly proportional to the number of screened compounds.
    • The number of hits is roughly proportional to the number of of screened compounds.
    • The graph search time is roughly proportional to the number of hits.

    Tip: it is rarely useful to return a huge number of hits (especially for human consumption).
    If the number of hits is limited, only the rapid screening time will increase with table size, which means the total search time will remain almost constant regardless of the table size.

How to estimate the memory need for the Structure Cache?

Typically 1 million drug-like structures consume around 100 MB memory in the structure cache of JChem.
Note: although JChem can drop the least recently used table from the structure cache if low on memory, it is recommended that all structure tables should fit in the cache (as cache loading can take a considerable amount of time). When estimating the memory need simply sum the number of rows in the tables.
The following table shows typical memory needs in a benchmark test:

Test specifications:
Number of molecules: 3,003,012
Fingerprint size: 16*32=512 bits
Average SMILES length per molecule: 37.8
Memory consumption: 277.01 MB
Caching time*: 159.2 seconds

* The test configuration was exactly the same as in the cartridge benchmark.

Memory need increases with the number of molecules,the size of fingerprints, and the average SMILES string size. The following approximation can be used:

Memory_need [bytes] = Number_of_molecules * (0.5*Average_smiles_length[characters] + Fingerprint_size[bits]/8 + 13.5)

The structure table fingerprint statistics generation function can be used to report the average smiles length and fingerprint size of a JChem table or JChem index. See more information in the JChem Manager command line usage (s command), at the Cartridge index statistics function and the Statistics tab at Instant JChem Schema editor.

Benchmarks

How fast is substructure searching in JChem Base?

The following tests demonstrate the speed of substructure search in JChem.
The test configuration was exactly the same as in the cartridge benchmark, and the same query structures were used.

  • A chemical table containing 19.5 million molecules was used.
  • Fingerprints and SMILES were cached by JChem.

Substructure search results, caching used: true
Screening
hits*
Screening time*
ms
Number
of hits
Total time
ms
09370944
0173501750
29022910
129925129963
136994791084
9396593988
294797628531198
22495793999
8469556981034
493941489986
620497560011309
1465349391462565667
30084411389297528561752
43141141466386787697621
* Screening is a quick search using fingerprints. The results of screening are further checked by the slower, but stricter atom-by-atom search.

How fast is importing in JChem Base?

The following table shows the duration of import in some cases. The configuration was exactly the same as in the cartridge benchmark. The -server JVM option was applied.

Number of structures Elapsed time
Duplicates allowed Duplicates not allowed
10,000 21 seconds 25 seconds
100,000 124 seconds 154 seconds
200,000 264 seconds 313 seconds

Notes:

  • If duplicates are allowed, time increases linearly with the number of molecules imported.
  • If duplicates are not allowed, JChem performs a search for every molecule to check if it is already in the structure table or not.
  • The table into which the structures were imported always contained an initial 10 structures and were stats-collected

How fast is searching in JChem Cartridge?

The following table shows the duration of search in JChem Cartridge using the following configuration:

Hardware (purchased in February of 2005):

Intel Quad CPU Q6600 2.40GHz desktop PC, 8GB memory, 2x750GB SATA hard drive in RAID 0 (with write cache enabled)

Software Environment:

  • Operating System:

    OS Name Distribution Name Kernel Version
    Linux CentOS release 4.6 2.6.9-67.0.7.ELsmp x86_64
  • Oracle Configuration:

    Oracle Version SGA Max Size (MB) SGA Target (MB) DB Buffer Cache Size (MB) Shared Pool Size (MB) Java Pool Size (MB) Large Pool Size (MB)
    10.2.0.3 0 (unset) 1536 1024 160 160 16

  • JChem Server Configuration:

    Java Version
    1.5.0_15

Target molecule set: PubChem

Number of structures: 19,528,372 structures

Session Date 2009-06-12

Operation Type Query Structure Number Of Hits Total Time (ms) SSS Time (ms) Screened Count Screening Time (ms)
t:s earlyResults:2000 Clc1cncc2c(cnnc12)N3CC3
0 748 734 0 719
t:s earlyResults:2000 C1CN1c2cnnc3c(cncc23)C4=CSC=C4
0 1499 1487 0 1475
t:s earlyResults:2000 CCSc1c(C=C(C=O)C#N)c2ccccn2c1C(O)=O
2 799 781 2 766
t:s earlyResults:2000 Nc1cc(cc2cc(c(N=N)c(O)c12)S(O)(=O)=O)S(O)(=O)=O
79 782 765 136 729
t:s earlyResults:2000 Oc1c(N=N)c(cc2cc(ccc12)S(O)(=O)=O)S(O)(=O)=O
93 786 764 224 725
t:s earlyResults:2000 NN=C1C(=O)NC(=S)N(C1=O)c2ccccc2O
93 859 835 93 810
t:s earlyResults:2000 O=C1ONC(N1c2ccccc2)-c3ccccc3
129 849 823 129 796
t:s earlyResults:2000 C(Sc1ncnc2ncnc12)-c3ccccc3
489 828 786 493 746
t:s earlyResults:2000 COc1ccc2nc3cc(Cl)ccc3cc2c1
698 839 796 846 728
t:s earlyResults:2000 Cc1cc(C)nc(NS(=O)(=O)c2ccccc2)n1
2853 1150 1023 2947 829
t:s earlyResults:2000 NC1=CC=NC2=C1C=CC(Cl)=C2
6001 1326 1189 6204 847
t:s earlyResults:2000 c1ncc2ncnc2n1
146256 7185 6665 146534 752
t:s earlyResults:2000 Clc1ccccc1
2975285 121449 82646 3008441 1837
t:s earlyResults:2000 O=Cc1ccccc1
3867876 172517 131161 4314114 1924


sep=! t:s!ctFilter:(PSA() <= 200) && (rotatableBondCount() <= 10) && (mass() <= 500) && (aromaticRingCount() <= 4)


O=C1ONC(N1c2ccccc2)-c3ccccc3 122 1020 922 129 783


sep=! t:s!ctFilter:(mass() <= 500) && (logP() <= 5) && (donorCount() <= 5) && (acceptorCount() <= 10)


O=C1ONC(N1c2ccccc2)-c3ccccc3 17 1260 941 129 794
 > 0.9 Nc1cc(cc2cc(c(N=N)c(O)c12)S(O)(=O)=O)S(O)(=O)=O
0 3375 3333 0 3333
 > 0.9 CCSc1c(C=C(C=O)C#N)c2ccccn2c1C(O)=O
0 3380 3339 0 3339
 > 0.9 O=C1ONC(N1c2ccccc2)-c3ccccc3
0 3863 3822 0 3822

The column names have the following meaning:
  • Operation Type: The name of the operator tested.
  • Query Structure: The query tested.
  • Number Of Hits: The number of structures returned by the query.
  • Total Time: The total time spent executing the SQL statement.
  • SSS Time: Fingerprint screening time + atom-by-atom search time.
  • Screened Count: The number of structure left over from the fingerprint screening as possible candidates meeting the search criteria.
  • Screening Time: The time the fingerprint screening took.

Tuning

What can/should I do to make JChem Cartride searches faster?

You can find a few performance tuning hints here: https://www.chemaxon.com/jchem/doc/admin/CartridgeFAQ.html#jcart_faster

Do you have a question? Would you like to learn more? Please browse among the related topics on our support forum or search the website. If you want to suggest modifications or improvements to our documentation email our support directly!