JChem chemical database concepts
Version 5.9.4
ChemAxon's range of database products include JChem Base, JChem Cartridge for Oracle and Instant JChem. JChem Base provides the main chemical database intelligence and search engine, and is the basis of the other two products. The cartridge offers an Oracle SQL interface for JChem Base and other ChemAxon products, and Instant JChem is an all-in-one desktop chemical database application. This chapter describes the main concepts of JChem Base, which therefore are also relevant for the understanding of JChem Cartridge and Instant JChem.
Contents
- JChem Base architecture
- Table types
- JChem table structure
- JChem property table
- Search types
- Search options
- The structure cache
- Fingerprints, optimization of fingerprint parameters
- Standardizer integration
- Chemical Terms calculated columns
- Handling of tautomers
JChem Base architecture
Web architecture: A typical interaction between a client and the database
- Using a web browser, the user enters a structure into MarvinSketch applet.
- A custom script (or servlet) for
substructure/similarity searching is activated, which
- Connects to a database through JDBC.
- Searches in a table containing structures.
- Creates a list containing the ID numbers of found structures.
- The script retrieves mixed structural and non-structural data by SQL SELECT statements, using the hit ID numbers and tables or views in the database.
- The script creates the page that displays the retrieved data in the client's browser using MarvinView applet.
- The user manipulates the data, etc.

Rich client architecture: A typical interaction between a client and the database
Another solution is a two-tier architecture, where the client Java or .NET application uses JChem Base and JDBC API to interact with the database. In this case, chemical structure input and output may use Marvin Sketch and View beans components embedded into the client application.

Web Services architecture: A typical interaction between a client and the database
The Web Services Server is a two-tier architecture ideal for client side applications using non-Java languages. The client communicates through the SOAP protocol to the JChem Web Services API, directing the server to manage JDBC calls to the database.- Using a web browser, script, or rich client, the user remotely interacts with the JChem Web Services Server.
- Acting in a black box, the JChem Web Services Server utilizes the core JChem classes to handle connections, search queries, and data manipulation and retrieval.
- Responding according to the SOAP protocol, the JChem Web Services Server, returns specific data in a structured XML format ready to be consumed by the client application.

There are other possibilities for invoking substructure searching, which might better suit your demands. For example, JChem Cartridge offers an alternative three-tier architecture.
JChem Cartridge architecture
In case of the cartridge, the client application or application server communicates through SQL only, and all internal JChem Base operations are hidden. For efficiency reasons, the JChem Cartridge itself uses a JChem computation server that may reside on a dedicated server. More details can be found in the JChem Cartridge Developers Guide.

Instant JChem architecture is described in the Instant JChem documentation.
Table types
There are different structure table types available in JChem, depending on the desired structure content. The table type determines the checks at table import and influences certain searching operations on the table.
- Molecules (default): This table type stores
specific structures, like single molecules, mixtures, salts,
coordination compounds and polymers.
For example, the following structures may be stored in molecule
tables:




- Reactions: Table for storing single step reactions.
For similarity searching, it can use reactant, product or reaction
similarity metrics (see details
here).
For example, the following reaction structure may be stored in
a reaction table:

- Markush libraries: Table for storing Markush
structures. (This table type is not allowed for Ms Access DBMS.)
See more information about the capabilities of these tables in the
JChem Query Guide.


- Query structures: Table for storing query structures.
Typically used for superstructure search. Note: SMILES strings
imported into this table will be interpreted as SMARTS. Standardization
of the inserted structures is described in the
standardization
documentation of the query guide.For more
information about available query features, see the
JChem Query Guide. Query tables
guarantee that all query features of stored structures are
correctly handled during superstructure search.




- Any structures: All types of structures are allowed,
but no structure type-specific searching takes place (e.g.
similarity values for reactions will not distinguish reactants,
products and reaction centers).
For example, the following structures may be stored in "Any
structure" type tables:





Compatibility notes: Tables created before JChem version 3.2 will be treated as "Any structures" to maintain previous behavior. The default type for new tables is "Molecules".
Table type can be specified at table or index creation. (See, for example: JChem Manager or index creation in JChem Cartridge.)
JChem table structure
Structure tables contain chemical structures and associated data, including both those used by the JChem system internally and custom, user defined data. (User defined data may be any information related to the chemical structure: name, external id, physico-chemical properties, etc. Any number and type of user defined data can be added to JChem tables (within the limits of the underlying RDBMS) and can be standard (static) or calculated columns). The following columns are used by JChem internally. They are added at table creation. User defined columns can be added at table creation or any other time later.
cd_id(JDBC type:INTEGER)Provides a unique identifier of the compound. If no value is specified for
cd_idduring the insertion of new structures, then the value is incremented automatically. A database index is automatically created for this column at table creation.cd_structure(JDBC type:LONGVARBINARY)Stores the structure in the original input format. It is used for displaying the structure and, in some cases, for searching (only when
cd_smilesis not available). MDL Molfiles and SDfiles are stored in compressed Molfile (csmol) form. This compression can be disabled to be directly readable by non-ChemAxon tools. See Setting options in the Administration Guide.cd_smiles(JDBC type:VARCHAR(1000)) or
cd_smarts(JDBC type:LONGVARBINARY) or
cd_markush(JDBC type:LONGVARBINARY)These columns store the standardized structure in a compact format, allowing efficient caching and hence fast structure searching. (If this representation of the structure is larger than the maximum length of the column or cannot be represented for any other reason, then
NULLis stored and thecd_structurefield is used during the search.)cd_smilesis used for Molecule, Any and Reaction table types, and contains ChemAxon Extended SMILES formatted structures.cd_smartsis used for Query table type, and contains ChemAxon Extended SMARTS formatted structures.cd_markushis used for Markush table type, and contains compressed Marvin documents of the internal Markush representation.
cd_formula(JDBC type:VARCHAR(100))The molecular formula of the molecule, eg C7H6O2. The atomic symbols are in Hill Order: C is listed first, followed by H, followed by the remaining elements in alphabetical order. If the molecular formula is often used for searching, it is advised to create a database index on this column.
-
cd_sortable_formula(JDBC type:VARCHAR(255)) A transformed cd_formula (see above), which is available for correct alphanumerical sorting of formulas. (For example, C4H10 should precede C12H26 since 4 is smaller than 12, but the simple alphanumerical ordering of strings would result the opposite order.) In the sortable formula column, all numbers in the formula are left padded with leading zeros up to a constant length of 5.
cd_molweight(JDBC type:DOUBLEorFLOAT)The molecular weight. If the molecular weight is often used for searching, it is advised to create a database index on this column.
cd_timestamp(JDBC type:TIMESTAMP)The date and time of the insertion or the last update of the chemical structure in the row.
cd_hash(JDBC type:INTEGER)A hash code of the chemical structure. It is used for duplicate search and in case of full structure search when no query features are specified on the query. It allows a rapid pre-filtering before atom-by-atom search. A database index is automatically created for this column at table creation.
cd_fp1, cd_fp2, cd_fp3, ...cd_fpn(JDBC type:INTEGER)The fingerprints of the chemical structures stored in several
INTEGERcolumns. It contains chemical hashed fingerprints and optionally structural keys. (If the table is configured that way.) Fingerprints are used during substructure and similarity searching in the fast screening phase. For reaction tables the reaction fingerprint of the reaction structure is stored instead to allow different reaction similarity search types.
Update Log (UL) tables
For each new structure table an accompanying "myTableName_UL" table is also created. These tables are used for refreshing structure cache in concurrent environments. If an insert, update, delete operation is performed, it will be logged in the _UL table. The next search can update the structure cache incrementally based on these logs.JChem property table
The JChem property table contains information about JChem's tables, registration information and further details about the database. Simply, this table identifies a JChem "environment" or "configuration". The default name of the table is "JChemProperties".
The JChem property table contains key-value pairs, like a property file or configuration file. The JChem Manager and Instant JChem applications and JChem Cartridge create and alter JChem Property Tables automatically. The JChem property table should only be edited by JChem applications or through the JChem API or JChem Cartridge operators/functions.
The property table contains these columns:
| Column Name | Description |
| prop_name | Keys that are used to access the value. |
| prop_value | The value of the property. |
| prop_value_ext | Used if the value property is too large for prop_value. |
Only one property value column (either prop_value or prop_value_ext) should be in use at any time. The other column should be null.
Relevant methods for creating a property table, checking for its existence, and adding, setting, or deleting properties can be found in the DatabaseProperties API.
There can be one or more property table for a database, located under the same or different schemas, if the database supports it. This can be used to create a multiuser database environment.
Search types
One major purpose of JChem tables is chemical structure search that can be combined with data search and is highly customizable. The following search types are available in JChem databases. Please click the links in the titles for more information.- Duplicate search
This search type can be used to decide equality of molecules. It is used during duplicate filter import. All structural features (atom types, isotopes, stereochemistry, query features, etc.) must be the same for matching two chemical structures, but for example coordinates and dimensionality are usually ignored.
Nomenclature: In JChem versions prior to 5.2, this search type was called "perfect search". In other cheminformatics toolkits or cartridges this functionality may be called exact structure search.
- Substructure search
Chemists are most often interested in this search type that decides whether a molecular structure contains a specific subgraph. Sometimes not only the chemical subgraph is provided, but certain query features also that further restrict the structure to search. If special molecular features are present on the query (eg. stereochemistry, charge, etc.), only those targets match which also contain the feature. However, if a feature is missing from the query, it is not required to be missing (by default). For more information, see the JChem Query Guide.
- Full structure search
A full structure search finds molecules that are equal (in size) to the query structure. (No additional fragments (e.g. salt) or heavy atoms are allowed.) Molecular features (by default) are evaluated the same way as described above for substructure search.
Nomenclature: This search type was called exact search in JChem versions prior to 5.2, but was renamed to reduce confusion. (Note that this search type is NOT the same as the exact search of several other cheminformatics tools or cartridges, where it is used for finding duplicates. This latter functionality is called duplicate search in our terminology.)
- Full fragment search
Full fragment search is a combination of substructure and full structure search: the query must fully match to a fragment of the target. Other fragments may be present in the target, but they are ignored. This search type is useful to perform an "full structure search" ignoring salts or solvents stored with the main structure in the target.
- Similarity search
-
This search type is used to retrieve structurally similar chemical structures. By default, it uses the Tanimoto metric of chemical hashed fingerprints, but other screening configurations are also available by the JChem Screen integration. In this latter case, additional descriptor tables need to be added to the database, that link to the JChem table.
The JChem Query Guide describes each search type in more detail.
Search options
In addition to the above search types, there are many search options that modify structure search behavior. The most important options are listed below, the full list can be found in the JChem Query Guide. Please click the links in the titles for more information.- Tautomer search
This search option can instruct the search engine to look for all tautomer forms the query, as generated by the Marvin plugin Tautomers.
- Vague bond search levels
- These search options allow a choice between several levels of
strictness in matching bond types, especially regarding aromaticity.
The higher the level is, the more tolerant the bond matching becomes.
The table below summarizes the vague bond levels.
Vague bond level Description Level 0 (off) Does not perform vague bond matching. Level 1 (default) Handling of 5-membered rings with ambiguous aromaticity Level 2 All query ring bonds become ″or aromatic″ Level 3 All query bonds (ring and chain) become ″or aromatic″ Level 4 Ignore all bond types - Stereo search
This search option specifies how stereochemistry should be evaluated:
- On (default): When the query does not contain stereo information, the hits will include results both with and without stereo information. Otherwise, the stereo information is taken into account during the search.
- Exact: All stereo information is tested for equality, meaning that a non-stereo query only matches non-stereo targets
- Diastereomer: retrieves stereo isomers where tetrahedral stereo information is present on the same stereo centers, but their configuration (parity) is arbitrary.
- Off: All stereo information is ignored
- Charge, isotopes, radical, valence settings
These search options specify how different atomic properties should be evaluated. Each of them has three settings. In the following the charge option is described, but all others of these options work the same way:
- By default, an uncharged atom matches both charged and uncharged atoms and a charged atom only matches charged ones.
- In exact charge mode, an uncharged atom only matches the uncharged atoms and a charged atom only charged ones.
- In ignore charge mode, the charge is not checked during searching.
- Chemical Terms filter
-
Searches can include extra conditions formulated in the Chemical Terms language. Chemical Terms is a chemistry language which allows users to formulate complex chemical questions, expressions and rules. Chemical Terms can contain references to functional groups, other structural elements and physico-chemical properties. The filter expressions are evaluated on the fly, but the Chemical Terms calculated columns are used if the column definition is part of the filter expression.
- Combining structure search with other (non-structure) conditions
-
Non-structural conditions can be added to the database search by specifying an SQL statement through the filterQuery property. In case of the Cartridge, another solution is to combine JChem operations with other conditions in the WHERE clause of the SQL SELECT statement.
- Other search options
-
The JChem Query guide summarizes all available search options.
The structure cache
To boost the speed of searching JChem caches fingerprints and structures in the application's memory space. (In case of a web application, the application is usually an application server. In case of the Cartridge, it is the JChem server. In rich client applications, including Instant JChem, the structure cache is created on the client machine.)
The structure cache is stored in a static pool, therefore a structure table is only cached once within the same Java Virtual Machine (JVM). When structure tables change between search operations, the structure cache is incrementally updated to ensure minimum overhead. Introduced in JChem 5.3.2, cache registration helps the load and update process.
The build-up of the cache can take considerable amount of time and normally occurs once, when the first search is started.
Depending on the number of molecules in the database, the size of the fingerprints, and the average molecule size, structure caching can have significant memory needs. Typically one million drug-like structures consume around 100 MB memory in the structure cache. JChem Base Performance Information contains more information about this subject.
Fingerprints, optimization of fingerprint parameters
JChem base uses different kinds of fingerprints for speeding up structural searches (via an initial fingerprint screening phase) and performing similarity searches. Fingerprints are bit strings that encode structural features present in the molecule. Different fingerprint types are used:- Chemical hashed fingerprints are used for most table types. These fingerprints are created by enumerating all linear patterns and rings (up to a predefined size) in the chemical structure, and the fingerprint bits are set using a hashing function.
- Reaction fingerprints are used for reaction tables. These contain different chemical hashed fingerprint sections, to allow different reaction similarity methods.
- Structural keys
are optional additional bits appended the fingerprints relating to
static patterns.
A fix set of structures can be specified in a file that will be used
as structural keys. The chemical hashed fingerprints will be extended
with the appropriate number of integer columns to provide 1 bit for each
structure. Important considerations related to structural keys:
- If a substructure search is run against the structure table and the query structure is identical to one of the structural keys, the time of the search will be close to zero. This is because the substructure search was already performed at import, and JChem only has to check whether the specified bit is set to 1. This is useful if you frequently run substructure searches on the table using the same set of query structures.
- If the query is not part of the structural key set, these keys are also considered for substructure and superstructure searches. Do not expect a major improvement in the effectiveness of screening in this case though, since the chemical hashed fingerprints are already very effective for most query structures.
- During similarity search the structural key part of the fingerprint is not considered (dissimilarity is only calculated from the chemical hashed fingerprint part).
- The speed of the import will slow down depending on the number of specified keys.
- The required memory for the structure cache will increase with the increased number of fingerprint columns.
- It must be taken into account that there are some query features which may cause loss of hits when used as features in structural keys.
Wrong features are:
- charge (when ignoring charges in the search)
- isotope (when ignoring isotopes in the search)
- aliphatic (A - does not have aromatic bond)
- not member of a ring (R0)
The section about Chemical Hashed Fingerprints describes also how fingerprints can be optimized for good search performance.
Standardization
To ensure that structure search results are correct, the query and the database molecules must share a similar representation. This is achieved automatically through table standardization in JChem databases.
The database molecules are standardized during structure import into a JChem table (and also during structure update). First the original source of the chemical structure is stored in the cd_structure field, which can then be used for displaying and export purposes. The standardized form is then stored in the cd_smiles field in a compact format. This representation is used by the search process. All additional structure-dependent data (fingerprints, molecular weight and formula, Chemical Terms calculated columns) are also calculated from the standardized form. In case of JChem index in the Cartridge, this process is done during index creation (and during structure insert/update in an indexed structure column), and the standardized form is stored within the index.
Query structures are standardized automatically before the search.
There are two types of standardization in the database:
- Default standardization: By default, the bonds of aromatic systems are replaced with aromatic bonds and explicit hydrogen atoms are transformed to implicit ones when possible. This standardization is adequate in most simple cases.
- Custom standardization: In some cases custom standardization
is necessary, e.g. if nitro groups in the input structures are
represented in two different forms. One can define custom
standardization rules with a
Standardizer configuration
(XML or action string). The custom configuration can be specified at
table or index creation. Custom standardization requires a Standardizer
license.
This demo animation shows standardization setup in Instant JChem, and the following figure illustrates the Standardizer configuration builder and an example transformation that can be achieved using Standardizer:

For more information, see the standardization section of the JChem Query Guide.
Chemical Terms calculated columns
JChem database products uniquely allow the storage of a wide range of automatically calculated chemical properties in JChem tables and JChem indices. These properties are stored in Chemical Terms calculated columns that can be added at table creation or any other time later.Calculated columns are automatically computed when a structure is inserted into the structure table or updated. The data to be calculated is defined by a Chemical Terms expression for each calculated columns. This language contains many structure-related functions, including the whole range of ChemAxon property calculations.
Calculated columns can be created using Instant JChem, JChem Manager and JChem Cartridge: for JChem tables and JChem index.
Handling of tautomers
There are various solutions for handling tautomers in JChem:1. Tautomer duplicate table or JChem index option
Tautomer duplicate table option or JChem index option. If set, JChem uses a generic tautomer for duplicate search. This will ensure that all theoretically possible tautomers will be found as duplicates. All other search types (substructure, similarity, etc.) will use the standardized version of the originally inserted tautomer. This approach is the fastest for duplicate search, but there is a little overhead at structure import.In case of searching in memory or files, the same approach can be accessed through the use of the tautomer duplicate filtering search option.
Stereo notes
- Double bond stereo information within tautomerizable groups is ignored, but double bond stereo information independent of tautomerizable groups is checked.
- Tetrahedral stereo (e.g. wedge) information is protected from tautomerization by default.
This protection can be switched off in the following ways:
- in JChem Cartridge by index creation;
- using jcman command line tool by setting
--set-switchoff-prots trueoption; - in API by setting StructureTableOption;
- in JChem Manager GUI by modifying table settings.
![]() |
![]() |
![]() |
![]() |
![]() |
|||
![]() |
|||
Implementation and architecture notes for tautomer duplicate tables/index tables
- cd_smiles contains the standardized version of the molecule, used by substructure, similarity and full structure search.
- cd_hash is calculated from the generic tautomer, however the generic tautomer itself is not stored.
- Duplicate search workflow in case of tautomer duplicate tables:
- The query is standardized, and then its generic tautomer is created. Hash code is calculated.
- Screening with hash code.
- On the remaining records: read cd_structure, standardize, and then create generic tautomer.
- The two generic tautomers( query and target) are checked with duplicate atom-by-atom search. Extra settings are also used here, e.g. data S-groups of the generic tautomer are checked.
2. Tautomer search option
Tautomer option of structure searching (to be used with duplicate search type, if duplicates are of interest): It simply enumerates all theoretically possible tautomers and searches them one by one.3. Canonical tautomer standardization
Canonical tautomer generation can be included directly into the standardization configuration of the table or index (tautomerize action). In this case, all search types will use the canonical tautomer, not just duplicate. (Warning: the tautomerize action dearomatizes the structure, so an additional aromatize action must follow it. See explanation here.)This approach uses the dominant tautomer model, which includes an energy (pKa) filter. This filter removes the transformations that are unlikely in solution. Tautomerization also depends on other environment factors: phase (solid / solution), solvent, temperature, etc., but these are not considered in either of our methods.
The following poster has more information about the generation of all, dominant and canonical tautomers: Tautomer generation. pKa based dominance conditions for generating dominant tautomers.
4. Custom tautomer transformations in standardization
It is also possible to add your own transformation rules for separate tautomerizable functional groups as part of the table/index standardization. A few examples can be found on this page.Discussion of the methods
All four methods are suitable for duplicate search, but for substructure search there are different issues:
- Option 1. is not considering tautomers for substructure search.
- The canonical tautomer generation algorithm requires a full molecule to properly consider energetics and the local structural environment of tautomerizable functional groups. For this reason, option 3. is not ideal for substructure search.
Therefore, only solutions 2. and 4. are recommended for substructure searching.
Concerning search speed, solutions 1., 3. and 4. are the fastest to search, because all transformations are done at registration time. Solution 2. is much slower to search than all other options.
Registration (indexing) speed is fastest at solution 2. (No registration overhead.) Second fastest is solution 1. (Little registration overhead.) Solutions 3. and 4. are the slowest to register. (Depending on standardizer configuration complexity.)
Do you have a question? Would you like to learn more? Please browse among the related topics on our support forum or search the website. If you want to suggest modifications or improvements to our documentation email our support directly!






