JChem Base Performance Information
- Scalability
- How scalable is JChem Cartridge / JChem Base?
- What is the maximum number of structures for JChem Base / JChem Cartridge?
- Hardware Requirements
- How to estimate hardware requirements for a JChem database?
- How to estimate the memory need for the Structure Cache?
- Benchmarks
- How fast is substructure searching in JChem Base?
- How fast is importing in JChem Base?
- How fast is searching in JChem Cartridge?
- Tuning
Scalability
How scalable is JChem Cartridge / JChem Base?
What is the maximum number of structures for JChem Base / JChem Cartridge?
In a single table the maximum number of rows is 231-1 (2,147,483,647).
We may increase this in the future if needed.
There is no practical limit for the number of tables.
Hardware requirements
How can I estimate hardware requirements for a JChem database?
- Memory:
JChem may use significant amount of memory for building a structure cache in memory.
Please see this topic on how to estimate the size of the structure cache.The total Java memory consumption (heap size) consists of:
- The size of the structure cache
- Processing memory for JChem Base: we recommend an additional 200 MB
- Processing memory for your Java application (built on JChem Base)
To determine the total amount of RAM requirement you should add:
- Java memory consumption (detailed above)
- Other programs running on that computer
- Memory requirement for the operating system
64 bit systems: A single Java process cannot allocate more than around 2 GB on 32 bit systems. If your Java memory needs will exceed this limit a 64 bit system is recommended (including hardware, operating system and Java).
- Disk space:
Apart from the number of structures it also depends on the format of the input file, and type of RDBMS engine used by JChem Base.
A benchmark result with 1 million structures (the NCI dataset was multiplied for the test), the RDBMS was Oracle:
Format Size of JChem table Original file size SDF 1.2 GB 3.7 GB SMILES 270 MB 50 MB Note: JChem compresses SDF in the cd_structure field by default (in the case of the NCI dataset to roughly about 4.4 times smaller). This can be disabled (e.g. if they need to be displayed directly by non-ChemAxon tools), but the storage size increases in this case.
- Processing power:
Comparing the performance of different hardware architectures is a complex topic and not the subject of this FAQ.
Some quick facts:
- JChem automatically uses all processors during a search
- The search speed scales well with the number of processors
- The search time is directly proportional to the size of the table (assuming there is no constraint on the number of hits - see below)
The following benchmarks can be used as a starting point:
- JChem Base substructure search speed
- JChem Cartridge substructure search speed
- JChem Base import speed
- JChem Cartridge indexing speed
- JChem Cartridge insert speed
For the same database the search time greatly depends on the type of query structure. For a very general query (e.g. benzene) there will be a lot of hits, meaning longer execution time, while more specific queries run very fast on the same large database.
The search time consists of- Screening time : this is a quick pre-filtering using fingerprints
- Graph search: slower, but only performed for the screened compounds
The following can be stated:
- The screening time is directly proportional to the size of the database table.
- The graph search time is directly proportional to the number of screened compounds.
- The number of hits is roughly proportional to the number of of screened compounds.
- The graph search time is roughly proportional to the number of hits.
Tip: it is rarely useful to return a huge number of hits (especially for human consumption).
If the number of hits is limited, only the rapid screening time will increase with table size, which means the total search time will remain almost constant regardless of the table size.
How to estimate the memory need for the Structure Cache?
Typically 1 million drug-like structures consume around 100 MB memory in
the structure cache of JChem.
Note: although JChem can drop the least recently used table from the structure
cache if low on memory, it is recommended that all structure tables should fit
in the cache (as cache loading can take a considerable amount of time).
When estimating the memory need simply sum the number of rows in the tables.
The following table shows typical memory needs in a benchmark test:
| Test specifications: | |
|---|---|
| Number of molecules: | 3,003,012 |
| Fingerprint size: | 16*32=512 bits |
| Average SMILES length per molecule: | 37.8 |
| Memory consumption: | 277.01 MB |
| Caching time*: | 159.2 seconds |
* The test configuration was exactly the same as in the cartridge benchmark.
Memory need increases with the number of molecules,the size of
fingerprints,
and the average SMILES string size. The following approximation can be
used:
Memory_need [bytes] = Number_of_molecules *
(0.5*Average_smiles_length[characters] + Fingerprint_size[bits]/8 + 13.5)
The structure table fingerprint statistics generation function can be used to report the average smiles length and fingerprint size of a JChem table or JChem index. See more information in the JChem Manager command line usage (s command), at the Cartridge index statistics function and the Statistics tab at Instant JChem Schema editor.
Benchmarks
How fast is substructure searching in JChem Base?
The following tests demonstrate the speed of substructure search in
JChem.
The test configuration was exactly the same as in the
cartridge benchmark, and the same query structures
were used.
- A chemical table containing 19.5 million molecules was used.
- Fingerprints and SMILES were cached by JChem.
| Substructure search results, caching used: true | |||
|---|---|---|---|
| Screening hits* |
Screening time* ms |
Number of hits |
Total time ms |
| 0 | 937 | 0 | 944 |
| 0 | 1735 | 0 | 1750 |
| 2 | 902 | 2 | 910 |
| 129 | 925 | 129 | 963 |
| 136 | 994 | 79 | 1084 |
| 93 | 965 | 93 | 988 |
| 2947 | 976 | 2853 | 1198 |
| 224 | 957 | 93 | 999 |
| 846 | 955 | 698 | 1034 |
| 493 | 941 | 489 | 986 |
| 6204 | 975 | 6001 | 1309 |
| 146534 | 939 | 146256 | 5667 |
| 3008441 | 1389 | 2975285 | 61752 |
| 4314114 | 1466 | 3867876 | 97621 |
How fast is importing in JChem Base?
The following table shows the duration of import in some cases.
The configuration was exactly the same as in the
cartridge benchmark.
The -server JVM option was applied.
| Number of structures | Elapsed time | ||
|---|---|---|---|
| Duplicates allowed | Duplicates not allowed | ||
| 10,000 | 21 seconds | 25 seconds | |
| 100,000 | 124 seconds | 154 seconds | |
| 200,000 | 264 seconds | 313 seconds | |
Notes:
- If duplicates are allowed, time increases linearly with the number of molecules imported.
- If duplicates are not allowed, JChem performs a search for every molecule to check if it is already in the structure table or not.
- The table into which the structures were imported always contained an initial 10 structures and were stats-collected
How fast is searching in JChem Cartridge?
The following table shows the duration of search in JChem Cartridge
using
the following configuration:
Hardware (purchased in February of 2005):
Intel Quad CPU Q6600 2.40GHz desktop PC, 8GB memory, 2x750GB SATA hard drive in RAID 0 (with write cache enabled)
Software Environment:
-
Operating System:
OS Name Distribution Name Kernel Version Linux CentOS release 4.6 2.6.9-67.0.7.ELsmp x86_64 -
Oracle Configuration:
Oracle Version SGA Max Size (MB) SGA Target (MB) DB Buffer Cache Size (MB) Shared Pool Size (MB) Java Pool Size (MB) Large Pool Size (MB) 10.2.0.3 0 (unset) 1536 1024 160 160 16 -
JChem Server Configuration:
Java Version 1.5.0_15
Target molecule set: PubChem
Number of structures: 19,528,372 structures
| Session Date | 2009-06-12 |
| Operation Type | Query Structure | Number Of Hits | Total Time (ms) | SSS Time (ms) | Screened Count | Screening Time (ms) |
| t:s earlyResults:2000 | Clc1cncc2c(cnnc12)N3CC3 |
0 | 748 | 734 | 0 | 719 |
| t:s earlyResults:2000 | C1CN1c2cnnc3c(cncc23)C4=CSC=C4 |
0 | 1499 | 1487 | 0 | 1475 |
| t:s earlyResults:2000 | CCSc1c(C=C(C=O)C#N)c2ccccn2c1C(O)=O |
2 | 799 | 781 | 2 | 766 |
| t:s earlyResults:2000 | Nc1cc(cc2cc(c(N=N)c(O)c12)S(O)(=O)=O)S(O)(=O)=O |
79 | 782 | 765 | 136 | 729 |
| t:s earlyResults:2000 | Oc1c(N=N)c(cc2cc(ccc12)S(O)(=O)=O)S(O)(=O)=O |
93 | 786 | 764 | 224 | 725 |
| t:s earlyResults:2000 | NN=C1C(=O)NC(=S)N(C1=O)c2ccccc2O |
93 | 859 | 835 | 93 | 810 |
| t:s earlyResults:2000 | O=C1ONC(N1c2ccccc2)-c3ccccc3 |
129 | 849 | 823 | 129 | 796 |
| t:s earlyResults:2000 | C(Sc1ncnc2ncnc12)-c3ccccc3 |
489 | 828 | 786 | 493 | 746 |
| t:s earlyResults:2000 | COc1ccc2nc3cc(Cl)ccc3cc2c1 |
698 | 839 | 796 | 846 | 728 |
| t:s earlyResults:2000 | Cc1cc(C)nc(NS(=O)(=O)c2ccccc2)n1 |
2853 | 1150 | 1023 | 2947 | 829 |
| t:s earlyResults:2000 | NC1=CC=NC2=C1C=CC(Cl)=C2 |
6001 | 1326 | 1189 | 6204 | 847 |
| t:s earlyResults:2000 | c1ncc2ncnc2n1 |
146256 | 7185 | 6665 | 146534 | 752 |
| t:s earlyResults:2000 | Clc1ccccc1 |
2975285 | 121449 | 82646 | 3008441 | 1837 |
| t:s earlyResults:2000 | O=Cc1ccccc1 |
3867876 | 172517 | 131161 | 4314114 | 1924 |
sep=! t:s!ctFilter:(PSA() <= 200) && (rotatableBondCount() <= 10) && (mass() <= 500) && (aromaticRingCount() <= 4) |
O=C1ONC(N1c2ccccc2)-c3ccccc3 | 122 | 1020 | 922 | 129 | 783 |
sep=! t:s!ctFilter:(mass() <= 500) && (logP() <= 5) && (donorCount() <= 5) && (acceptorCount() <= 10) |
O=C1ONC(N1c2ccccc2)-c3ccccc3 | 17 | 1260 | 941 | 129 | 794 |
| > 0.9 | Nc1cc(cc2cc(c(N=N)c(O)c12)S(O)(=O)=O)S(O)(=O)=O |
0 | 3375 | 3333 | 0 | 3333 |
| > 0.9 | CCSc1c(C=C(C=O)C#N)c2ccccn2c1C(O)=O |
0 | 3380 | 3339 | 0 | 3339 |
| > 0.9 | O=C1ONC(N1c2ccccc2)-c3ccccc3 |
0 | 3863 | 3822 | 0 | 3822 |
- Operation Type: The name of the operator tested.
- Query Structure: The query tested.
- Number Of Hits: The number of structures returned by the query.
- Total Time: The total time spent executing the SQL statement.
- SSS Time: Fingerprint screening time + atom-by-atom search time.
- Screened Count: The number of structure left over from the fingerprint screening as possible candidates meeting the search criteria.
- Screening Time: The time the fingerprint screening took.
Tuning
What can/should I do to make JChem Cartride searches faster?
You can find a few performance tuning hints here: https://www.chemaxon.com/jchem/doc/admin/CartridgeFAQ.html#jcart_fasterDo you have a question? Would you like to learn more? Please browse among the related topics on our support forum or search the website. If you want to suggest modifications or improvements to our documentation email our support directly!
