JChem Developers Guide

Version 5.1.4

1. JChem chemical database concepts

ChemAxon range of database products are JChem Base, JChem Cartridge for Oracle and Instant JChem. JChem Base provides the main chemical database intelligence and search engine, and is the basis of the other two products. The cartridge offers an Oracle SQL interface for JChem Base and other ChemAxon products, and Instant JChem is an all-in-one desktop chemical database application. This chapter describes the main concepts of JChem Base, which therefore are also relevant for the understanding of JChem Cartridge and Instant JChem.

JChem Base architecture

Web architecture: A typical interaction between a client and the database

  1. Using a web browser, the user enters a structure into MarvinSketch applet.
  2. A custom script (or servlet) for substructure/similarity searching is activated, which
    • Connects to a database through JDBC.
    • Searches in a table containing structures.
    • Creates a list containing the ID numbers of found structures.
  3. The script retrieves mixed structural and non-structural data by SQL SELECT statements, using the hit ID numbers and tables or views in the database.
  4. The script creates the page that displays the retrieved data in the client's browser using MarvinView applet.
  5. The user manipulates the data, etc.

model

Rich client architecture: A typical interaction between a client and the database

Another solution is a two-tier architecture, where the client Java or .NET application uses JChem Base and JDBC API to interact with the database. In this case, chemical structure input and output may use Marvin Sketch and View beans components embedded into the client application.

model

There are other possibilities for invoking substructure searching, which might better suit your demands. For example, JChem Cartridge offers an alternative three-tier architecture.

JChem Cartridge architecture

In case of the cartridge, the client application or application server communicates through SQL only, and all internal JChem Base operations are hidden. For efficiency reasons, the JChem Cartridge itself uses a JChem computation server that may reside on a dedicated server. More details can be found in the JChem Cartridge Developers Guide.

model

Instant JChem architecture is described in the Instant JChem documentation.

Table types

There are different structure table types available in JChem, depending on the desired structure content. The table type determines the checks at table import and influences certain searching operations on the table.

Compatibility notes: Tables created before JChem version 3.2 will be treated as "Any structures" to maintain previous behavior. The default type for new tables is "Molecules".

Table type can be specified at table or index creation. (See, for example: JChem Manager or index creation in JChem Cartridge.)

JChem table structure

Structure tables contain chemical structures and associated data, including both those used by the JChem system internally and custom, user defined data. (User defined data may be any information related to the chemical structure: name, external id, physico-chemical properties, etc. Any number and type of user defined data can be added to JChem tables - within the limits of the underlying RDBMS, and these can be standard (static) or calculated columns). The following columns are used by JChem internally. They are added at table creation. User defined columns can be added at table creation or any other time later.

cd_id (JDBC type: INTEGER)

Provides a unique identifier of the compound. If no value is specified for cd_id during the insertion of new structures, then the value is incremented automatically. A database index is automatically created for this column at table creation.

cd_structure (JDBC type: LONGVARBINARY)

Stores the structure in the original input format. It is used for displaying the structure and, in some cases, for searching (only when cd_smiles is not available). MDL Molfiles and SDfiles are stored in compressed Molfile (csmol) form. This compression can be disabled to be directly readable by non-ChemAxon tools. See Setting options in the Administration Guide.

cd_smiles (JDBC type: VARCHAR(1000)) or
cd_smarts (JDBC type: LONGVARBINARY) or
cd_markush (JDBC type: LONGVARBINARY)

These columns store the standardized structure in a compact format, allowing efficient caching and hence fast structure searching. (If this representation of the structure is larger than the maximum length of the column or cannot be represented for any other reason, then NULL is stored and the cd_structure field is used during the search.)

  • cd_smiles is used for Molecule, Any and Reaction table types, and contains ChemAxon Extended SMILES formatted structures.
  • cd_smarts is used for Query table type, and contains ChemAxon Extended SMARTS formatted structures.
  • cd_markush is used for combinatorial Markush table type, and contains compressed Marvin documents of the internal Markush representation.

cd_formula (JDBC type: VARCHAR(100))

The molecular formula of the molecule, eg C7H6O2. The atomic symbols are in Hill Order: C is listed first, followed by H, followed by the remaining elements in alphabetical order. If the molecular formula is often used for searching, it is advised to create a database index on this column.

cd_sortable_formula (JDBC type: VARCHAR(255))

A transformed cd_formula (see above), which is available for correct alphanumerical sorting of formulas. (For example, C4H10 should precede C12H26 since 4 is smaller than 12, but the simple alphanumerical ordering of strings would result the opposite order.) In the sortable formula column, all numbers in the formula are left padded with leading zeros up to a constant length of 5.

cd_molweight (JDBC type: DOUBLE or FLOAT)

The molecular weight. If the molecular weight is often used for searching, it is advised to create a database index on this column.

cd_timestamp (JDBC type: TIMESTAMP)

The date and time of the insertion or the last update of the chemical structure in the row.

cd_hash (JDBC type: INTEGER)

A hash code of the chemical structure. It is used for PERFECT (duplicate) search and in case of EXACT search when no query features are specified on the query. It allows a rapid pre-filtering before atom-by-atom search. A database index is automatically created for this column at table creation.

cd_fp1, cd_fp2, cd_fp3, ...cd_fpn (JDBC type: INTEGER)

The fingerprints of the chemical structures stored in several INTEGER columns. It contains chemical hashed fingerprints and optionally structural keys. (If the table is configured that way.) Fingerprints are used during substructure and similarity searching in the fast screening phase. For reaction tables the reaction fingerprint of the reaction structure is stored instead to allow different reaction similarity search types.

Search types

One main purpose of JChem tables is chemical structure search that can be combined with data search and is highly customizable. The following search types are available in JChem databases. Please click the links in the titles for more information.
Perfect search

This search type can be used to decide equality of molecules. It is used during duplicate filter import. All structural features (atom types, isotopes, stereochemistry, query features, etc.) must be the same for matching two chemical structures, but for example coordinates and dimensionality are usually ignored.

Substructure search

Chemists are most often interested in this search type that is whether a molecular structure contains a specific subgraph. Sometimes not only the chemical subgraph is provided, but certain query features also that further restrict the structure to search. If special molecular features are present on the query (eg. stereochemistry, charge, etc.), only those targets match which also contain the feature. However, if a feature is missing from the query, it is not required to be missing (by default). For more information, see the JChem Query Guide.

Exact structure search

An exact structure search finds molecules that are equal (in size) to the query structure. (No additional fragments (e.g. salt) or heavy atoms are allowed.) Molecular features (by default) are evaluated the same way as described above for substructure search.

Note that this search type is NOT the same as the exact search of several other cheminformatics tools or cartridges, where it is used for finding duplicates. This latter functionality is called perfect search in our terminology.

Exact fragment search

Exact fragment search is a combination of substructure and exact structure search: the query must exactly match to a full fragment of the target. Other fragments may be present in the target, but they are ignored. This search type is useful to perform an "Exact structure search" ignoring salts or solvents stored with the main structure in the target.

Similarity search

This search type is used to retrieve structurally similar chemical structures. By default, it uses the Tanimoto metric of chemical hashed fingerprints, but other screening configurations are also available by the JChem Screen integration. In this latter case, additional descriptor tables need to be added to the database, that link to the JChem table.

The JChem Query Guide describes each search type in more detail.

Search options

In addition to the above search types, there are many search options that modify structure search behavior. The most important options are listed below, the full list can be found in the JChem Query Guide. Please click the links in the titles for more information.
Tautomer search

This search option can instruct the search engine to look for all tautomer forms the query, as generated by the Marvin plugin Tautomers.

Vague bond search levels
These search options allow a choice between several levels of strictness in matching bond types, especially regarding aromaticity. The higher the level is, the more tolerant the bond matching becomes.

The table below summarizes the vague bond levels.

Vague bond level Description
Level 0 (off) Does not perform vague bond matching.
Level 1 (default) Handling of 5-membered rings with ambiguous aromaticity
Level 2 All query ring bonds become ″or aromatic″
Level 3 All query bonds (ring and chain) become ″or aromatic″
Level 4 Ignore all bond types

Stereo search

This search option specifies how stereochemistry should be evaluated:

  • On (default): When the query does not contain stereo information, the hits will include results both with and without stereo information. Otherwise, the stereo information is taken into account during the search.
  • Exact: All stereo information is tested for equality, meaning that a non-stereo query only matches non-stereo targets
  • Diastereomer: retrieves stereo isomers where tetrahedral stereo information is present on the same stereo centers, but their configuration (parity) is arbitrary.
  • Off: All stereo information is ignored

Charge, isotopes, radical, valence settings

These search options specify how different atomic properties should be evaluated. Each of them has three settings. In the following the charge option is described, but all others of these options work the same way:

  • By default, an uncharged atom matches both charged and uncharged atoms and a charged atom only matches charged ones.
  • In exact charge mode, an uncharged atom only matches the uncharged atoms and a charged atom only charged ones.
  • In ignore charge mode, the charge is not checked during searching.

Chemical Terms filter

Searches can include extra conditions formulated in the Chemical Terms language. Chemical Terms is a chemistry language which allows users to formulate complex chemical questions, expressions and rules. Chemical Terms can contain references to functional groups, other structural elements and physico-chemical properties. The filter expressions are evaluated on the fly, but the Chemical Terms calculated columns are used if the column definition is part of the filter expression.

Combining structure search with other (non-structure) conditions

Non-structural conditions can be added to the database search by specifying an SQL statement through the filterQuery property. In case of the Cartridge, another solution is to combine JChem operations with other conditions in the WHERE clause of the SQL SELECT statement.

Other search options

The JChem Query guide summarizes all available search options.

The structure cache

To boost the speed of searching JChem caches fingerprints and structures in the application's memory space. (In case of a web application, the application is usually an application server. In case of the Cartridge, it is the JChem server. In rich client applications, including Instant JChem, the structure cache is created on the client machine.)

The structure cache is stored in a static pool, therefore a structure table is only cached once within the same Java Virtual Machine (JVM).

The build-up of the cache can take considerable amount of time and normally occurs once, when the first search is started.

Depending on the number of molecules in the database, the size of the fingerprints and the average molecule size, structure caching can have significant memory needs. Typically one million drug-like structures consume around 100 MB memory in the structure cache. The JChem FAQ contains more information about this subject.

When structure tables change between search operations, the structure cache is incrementally updated to ensure minimum overhead.

Fingerprints, optimization of fingerprint parameters

JChem base uses different kinds of fingerprints for speeding up structural searches (via an initial fingerprint screening phase) and performing similarity searches. Fingerprints are bit strings that encode structural features present in the molecule. Different fingerprint types are used:

The section about Chemical Hashed Fingerprints describes also how fingerprints can be optimized for good search performance.

Standardization

To ensure that structure search results are correct, the query and the database molecules must share a similar representation. This is achieved automatically through table standardization in JChem databases.

The database molecules are standardized during structure import into a JChem table (and also during structure update). First the original source of the chemical structure is stored in the cd_structure field, which can then be used for displaying and export purposes. The standardized form is then stored in the cd_smiles field in a compact format. This representation is used by the search process. All additional structure-dependent data (fingerprints, molecular weight and formula, Chemical Terms calculated columns) are also calculated from the standardized form. In case of JChem index in the Cartridge, this process is done during index creation (and during structure insert/update in an indexed structure column), and the standardized form is stored within the index.

Query structures are standardized automatically before the search.

There are two types of standardization in the database:

Chemical Terms calculated columns

JChem database products uniquely allow the storage of a wide range of automatically calculated chemical properties in JChem tables and JChem indices. These properties are stored in Chemical Terms calculated columns that can be added at table creation or any other time later.

Calculated columns are automatically computed when a structure is inserted into the structure table or updated. The data to be calculated is defined by a Chemical Terms expression for each calculated columns. This language contains many structure-related functions, including the whole range of ChemAxon property calculations.

Calculated columns can be created using Instant JChem, JChem Manager and JChem Cartridge: for JChem tables and JChem index.

Handling of tautomers

There are various solutions for handling tautomers in JChem:
  1. Tautomer duplicate filtering table option or JChem index option. If set, JChem uses a generic tautomer for duplicate (perfect) search. This will ensure that all theoretically possible tautomers will be found as duplicates. All other search types (substructure, similarity, etc.) will use the standardized version of the originally inserted tautomer. This approach is the fastest for perfect search, but there is a little overhead at structure import.

    With this method, stereo information within tautomerizable groups are ignored, but stereo information independent of tautomerizable groups are checked.

  2. Tautomer option of structure searching (to be used with perfect search type, if duplicates are of interest): It simply enumerates all theoretically possible tautomers and searches them one by one.
  3. Canonical tautomer generation can be included directly into the standardization configuration of the table or index (tautomerize action). In this case, all search types will use the canonical tautomer, not just perfect. (Warning: the tautomerize action dearomatizes the structure, so an additional aromatize action must follow it. See explanation here.)
  4. This approach uses the dominant tautomer model, which includes an energy (pKa) filter. This filter removes the transformations that are unlikely in solution. Tautomerization also depends on other environment factors: phase (solid / solution), solvent, temperature, etc., but these are not considered in either of our methods.

    The following poster has more information about the generation of all, dominant and canonical tautomers: Tautomer generation. pKa based dominance conditions for generating dominant tautomers.

  5. It is also possible to add your own transformation rules for separate tautomerizable functional groups as part of the table/index standardization. A few examples can be found on this page.

All four methods are suitable for perfect (duplicate) search, but for substructure search there are different issues:

Therefore, only solutions 2. and 4. are recommended for substructure searching.

Concerning search speed, solutions 1., 3. and 4. are the fastest to search, because all transformations are done at registration time. Solution 1. is much slower to search than all other options.

Registration (indexing) speed is fastest at solution 2. (No registration overhead.) Second fastest is solution 1. (Little registration overhead.) Solutions 3. and 4. are the slowest to register. (Depending on standardizer configuration complexity.)

 
Copyright © 1999-2008 ChemAxon Ltd.    All rights reserved.