ChemAxon's range of database products include JChem Base, JChem Cartridge for Oracle and Instant JChem. JChem Base provides the main chemical database intelligence and search engine, and is the basis of the other two products. The cartridge offers an Oracle SQL interface for JChem Base and other ChemAxon products, and Instant JChem is an all-in-one desktop chemical database application. This chapter describes the main concepts of JChem Base, which therefore are also relevant for the understanding of JChem Cartridge and Instant JChem.



There are other possibilities for invoking substructure searching, which might better suit your demands. For example, JChem Cartridge offers an alternative three-tier architecture.

Instant JChem architecture is described in the Instant JChem documentation.
There are different structure table types available in JChem, depending on the desired structure content. The table type determines the checks at table import and influences certain searching operations on the table.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
![]() |
![]() |
|
![]() |
![]() |
Compatibility notes: Tables created before JChem version 3.2 will be treated as "Any structures" to maintain previous behavior. The default type for new tables is "Molecules".
Table type can be specified at table or index creation. (See, for example: JChem Manager or index creation in JChem Cartridge.)
Structure tables contain chemical structures and associated data, including both those used by the JChem system internally and custom, user defined data. (User defined data may be any information related to the chemical structure: name, external id, physico-chemical properties, etc. Any number and type of user defined data can be added to JChem tables (within the limits of the underlying RDBMS) and can be standard (static) or calculated columns). The following columns are used by JChem internally. They are added at table creation. User defined columns can be added at table creation or any other time later.
cd_id
(JDBC type: INTEGER)
Provides a unique identifier of the compound. If no
value is specified for cd_id during the
insertion of new structures, then the value is incremented
automatically. A database index is automatically created for this
column at table creation.
cd_structure
(JDBC type: LONGVARBINARY)
Stores the structure in the original input format.
It is used for displaying the structure and, in some cases, for
searching (only when cd_smiles is not available).
MDL Molfiles and SDfiles are stored in
compressed
Molfile (csmol) form. This compression can be disabled to be
directly readable by non-ChemAxon tools. See
Setting options
in the Administration Guide.
cd_smiles
(JDBC type: VARCHAR(1000)) or cd_smarts
(JDBC type: LONGVARBINARY) or cd_markush
(JDBC type: LONGVARBINARY)
These columns store the
standardized
structure in a compact format, allowing efficient caching and hence
fast structure searching. (If this representation of the
structure is larger than the maximum length of the column or cannot be
represented for any other reason, then NULL is
stored and the cd_structure field is used during the
search.)
cd_smiles is used for Molecule, Any and
Reaction table types, and contains
ChemAxon Extended
SMILES formatted structures.cd_smarts is used
for Query table type, and contains
ChemAxon Extended
SMARTS formatted structures.cd_markush is used
for combinatorial Markush table type, and contains compressed
Marvin
documents of the internal Markush representation.cd_formula
(JDBC type: VARCHAR(100))
The molecular formula of the molecule, eg C7H6O2. The atomic symbols are in Hill Order: C is listed first, followed by H, followed by the remaining elements in alphabetical order. If the molecular formula is often used for searching, it is advised to create a database index on this column.
cd_sortable_formula
(JDBC type: VARCHAR(255))
A transformed cd_formula (see above), which is available for correct alphanumerical sorting of formulas. (For example, C4H10 should precede C12H26 since 4 is smaller than 12, but the simple alphanumerical ordering of strings would result the opposite order.) In the sortable formula column, all numbers in the formula are left padded with leading zeros up to a constant length of 5.
cd_molweight
(JDBC type: DOUBLE or
FLOAT)
The molecular weight. If the molecular weight is often used for searching, it is advised to create a database index on this column.
cd_timestamp
(JDBC type: TIMESTAMP)
The date and time of the insertion or the last update of the chemical structure in the row.
cd_hash
(JDBC type: INTEGER)
A hash code of the chemical structure. It is used for duplicate search and in case of full structure search when no query features are specified on the query. It allows a rapid pre-filtering before atom-by-atom search. A database index is automatically created for this column at table creation.
cd_fp1, cd_fp2, cd_fp3,
...cd_fpn (JDBC type:
INTEGER)
The fingerprints of the
chemical structures stored in several INTEGER columns.
It contains chemical hashed fingerprints and optionally structural keys.
(If the table is configured that way.) Fingerprints are used during
substructure and similarity searching in the fast screening phase.
For reaction tables the
reaction fingerprint of the reaction
structure is stored instead to allow different reaction similarity
search types.
The JChem property table contains information about JChem's tables, registration information and further details about the database. Simply, this table identifies a JChem "environment" or "configuration". The default name of the table is "JChemProperties".
The JChem property table contains key-value pairs, like a property file or configuration file. The JChem Manager and Instant JChem applications and JChem Cartridge create and alter JChem Property Tables automatically. The JChem property table should only be edited by JChem applications or through the JChem API or JChem Cartridge operators/functions.
The property table contains these columns:
| Column Name | Description |
| prop_name | Keys that are used to access the value. |
| prop_value | The value of the property. |
| prop_value_ext | Used if the value property is too large for prop_value. |
Only one property value column (either prop_value or prop_value_ext) should be in use at any time. The other column should be null.
Relevant methods for creating a property table, checking for its existence, and adding, setting, or deleting properties can be found in the DatabaseProperties API.
There can be one or more property table for a database, located under the same or different schemas, if the database supports it. This can be used to create a multiuser database environment.
This search type can be used to decide equality of molecules. It is used during duplicate filter import. All structural features (atom types, isotopes, stereochemistry, query features, etc.) must be the same for matching two chemical structures, but for example coordinates and dimensionality are usually ignored.
Nomenclature: In JChem versions prior to 5.2, this search type was called "perfect search". In other cheminformatics toolkits or cartridges this functionality may be called exact structure search.
Chemists are most often interested in this search type that decides whether a molecular structure contains a specific subgraph. Sometimes not only the chemical subgraph is provided, but certain query features also that further restrict the structure to search. If special molecular features are present on the query (eg. stereochemistry, charge, etc.), only those targets match which also contain the feature. However, if a feature is missing from the query, it is not required to be missing (by default). For more information, see the JChem Query Guide.
A full structure search finds molecules that are equal (in size) to the query structure. (No additional fragments (e.g. salt) or heavy atoms are allowed.) Molecular features (by default) are evaluated the same way as described above for substructure search.
Nomenclature: This search type was called exact search in JChem versions prior to 5.2, but was renamed to reduce confusion. (Note that this search type is NOT the same as the exact search of several other cheminformatics tools or cartridges, where it is used for finding duplicates. This latter functionality is called duplicate search in our terminology.)
Full fragment search is a combination of substructure and full structure search: the query must fully match to a fragment of the target. Other fragments may be present in the target, but they are ignored. This search type is useful to perform an "full structure search" ignoring salts or solvents stored with the main structure in the target.
This search type is used to retrieve structurally similar chemical structures. By default, it uses the Tanimoto metric of chemical hashed fingerprints, but other screening configurations are also available by the JChem Screen integration. In this latter case, additional descriptor tables need to be added to the database, that link to the JChem table.
The JChem Query Guide describes each search type in more detail.
This search option can instruct the search engine to look for all tautomer forms the query, as generated by the Marvin plugin Tautomers.
The table below summarizes the vague bond levels.
| Vague bond level | Description |
| Level 0 (off) | Does not perform vague bond matching. |
| Level 1 (default) | Handling of 5-membered rings with ambiguous aromaticity |
| Level 2 | All query ring bonds become ″or aromatic″ |
| Level 3 | All query bonds (ring and chain) become ″or aromatic″ |
| Level 4 | Ignore all bond types |
This search option specifies how stereochemistry should be evaluated:
These search options specify how different atomic properties should be evaluated. Each of them has three settings. In the following the charge option is described, but all others of these options work the same way:
Searches can include extra conditions formulated in the Chemical Terms language. Chemical Terms is a chemistry language which allows users to formulate complex chemical questions, expressions and rules. Chemical Terms can contain references to functional groups, other structural elements and physico-chemical properties. The filter expressions are evaluated on the fly, but the Chemical Terms calculated columns are used if the column definition is part of the filter expression.
Non-structural conditions can be added to the database search by specifying an SQL statement through the filterQuery property. In case of the Cartridge, another solution is to combine JChem operations with other conditions in the WHERE clause of the SQL SELECT statement.
The JChem Query guide summarizes all available search options.
To boost the speed of searching JChem caches fingerprints and structures in the application's memory space. (In case of a web application, the application is usually an application server. In case of the Cartridge, it is the JChem server. In rich client applications, including Instant JChem, the structure cache is created on the client machine.)
The structure cache is stored in a static pool, therefore a structure table is only cached once within the same Java Virtual Machine (JVM).
The build-up of the cache can take considerable amount of time and normally occurs once, when the first search is started.
Depending on the number of molecules in the database, the size of the fingerprints and the average molecule size, structure caching can have significant memory needs. Typically one million drug-like structures consume around 100 MB memory in the structure cache. The JChem FAQ contains more information about this subject.
When structure tables change between search operations, the structure cache is incrementally updated to ensure minimum overhead.
The section about Chemical Hashed Fingerprints describes also how fingerprints can be optimized for good search performance.
To ensure that structure search results are correct, the query and the database molecules must share a similar representation. This is achieved automatically through table standardization in JChem databases.
The database molecules are standardized during structure import into a JChem table (and also during structure update). First the original source of the chemical structure is stored in the cd_structure field, which can then be used for displaying and export purposes. The standardized form is then stored in the cd_smiles field in a compact format. This representation is used by the search process. All additional structure-dependent data (fingerprints, molecular weight and formula, Chemical Terms calculated columns) are also calculated from the standardized form. In case of JChem index in the Cartridge, this process is done during index creation (and during structure insert/update in an indexed structure column), and the standardized form is stored within the index.
Query structures are standardized automatically before the search.
There are two types of standardization in the database:
This demo animation shows standardization setup in Instant JChem, and the following figure illustrates the Standardizer configuration builder and an example transformation that can be achieved using Standardizer:

For more information, see the standardization section of the JChem Query Guide.
Calculated columns are automatically computed when a structure is inserted into the structure table or updated. The data to be calculated is defined by a Chemical Terms expression for each calculated columns. This language contains many structure-related functions, including the whole range of ChemAxon property calculations.
Calculated columns can be created using Instant JChem, JChem Manager and JChem Cartridge: for JChem tables and JChem index.
In case of searching in memory or files, the same approach can be accessed through the use of the tautomer duplicate filtering search option.
![]() |
![]() |
![]() |
![]() |
![]() |
|||
![]() |
|||
This approach uses the dominant tautomer model, which includes an energy (pKa) filter. This filter removes the transformations that are unlikely in solution. Tautomerization also depends on other environment factors: phase (solid / solution), solvent, temperature, etc., but these are not considered in either of our methods.
The following poster has more information about the generation of all, dominant and canonical tautomers: Tautomer generation. pKa based dominance conditions for generating dominant tautomers.
All four methods are suitable for duplicate search, but for substructure search there are different issues:
Therefore, only solutions 2. and 4. are recommended for substructure searching.
Concerning search speed, solutions 1., 3. and 4. are the fastest to search, because all transformations are done at registration time. Solution 2. is much slower to search than all other options.
Registration (indexing) speed is fastest at solution 2. (No registration overhead.) Second fastest is solution 1. (Little registration overhead.) Solutions 3. and 4. are the slowest to register. (Depending on standardizer configuration complexity.)