Administration Guide of JChem

Version 5.0.3

The Administration Guide of JChem provides background information about the administration of structure tables in relational databases using JChemManager. It contains information about table creation, importing and exporting structure files.

Contents

  1. About JChemManager
  2. Installing and Starting JChemManager
  3. Connecting to Databases
  4. Creating Structure Tables
  5. Importing Structure Files
  6. Exporting Structure Files
  7. Deleting Structure Tables
  8. Modifying Structure Tables
  9. Changing Table Settings
  10. Setting Options
  11. Regeneration of Fix Columns
  12. Cartridge

1. About JChemManager

JChemManager is a tool for creating and removing structure tables, importing and exporting structure files into structure tables. The program is a two-tier Java application that can be run in operation systems supplied with Java (see Installing and Starting JChemManager for more on the system requirements).

Storing settings:

prop_name

prop_value

table.CDUSER.STRUCTURES.fingerprint.numberOfBits

512

table.CDUSER.STRUCTURES.fingerprint.numberOfEdges

6

table.CDUSER.STRUCTURES.fingerprint.numberOfOnes

2

table.CDUSER.STRUCTURES.updateCounter

5

You can find more information on the property table in the description of multiuser options.

2. Installing and Starting JChemManager

Requirements

In order to run JChemManager the following software need to be installed:

*For Microsoft Access JChem uses the ODBC driver included in Java, no external driver needed. ODBC connection to other database engines is not supported.

Running JChemManager

Prepare the usage of the jcman script as described in Preparing the Usage of JChem Batch Files and Shell Scripts. Run JChemManager by entering

    jcman

Warning:
To avoid the conflict of different versions of classes, jchem.jar should not be included in the system's class path (CLASSPATH). See jcman or jcman.bat in the bin subdirectory as examples. Avoid using directory names with spaces (e.g. use PROGRA~1 instead of Program Files in Win32) For example, such problem might occur when you view a HTML page using a browser, which contains a Marvin applet and, at the same time, the system's class path consists of jchem.jar.

Command line usage

Many operations of jcman can also be invoked without graphic interface. The command line usage makes the administration easier from a remote machine

In the above cases connect to the server using telnet or ssh and invoke jcman. The list of options (as listed with jcman -h):

Usage of GUI program:
        jcman
Usage of command line program:
        jcman <command> [options]

Commands:
c <table>                  create table in database
t                          list structure tables
t <table>                  show information on a structure table
t <table> <row>            show the specified row
a <table> <file>           import (add molecules) from a file into a table
x <table> <format>         export table to standard output.
                           format:
                               sdf (MDL SDfile)
                               mol (MDL Molfile)
                               rdf (MDL RDMolfile)
                               smi (Daylight smiles)
                               mrv (Marvin document)
x <table> <file>           export a table into a file. (The format will be
                           determined from the extension:.sdf/.mol/.smi
                           /.rdf/.mrv)
x <table> <file> <field>[:<field>]
                           exports specific fields from a table into a file
d <table>                  drop (remove) a structure table
u                          updates database structure and regenerates old
                           tables if necessary (typically after upgrade)
r                          regenerate all structure tables
r <table>                  regenerate a structure table
s <table>                  calculates and prints statistics for table
m <table>                  miscellaneous operations on the specified table
g <global-option>          set options affecting all tables

Options (general):
-h --help                  this help message
   --driver <JDBC driver>  the JDBC driver to use
   --dburl <url>           the database URL to connect
   --proptable <table>     the name of property table
-l --login <login>         login name
-p --password <password>   password
-s --saveconf              save settings into
"<an actual directory in your file system appears here>"

Options for table creation:
   --fplength <n>          fingerprint size in bits    (default: 512)
   --bits <n>              bits to be set for patterns (default: 2)
   --bonds <n>             pattern length              (default: 6)
   --coldefs <column defs> column definitions. If not empty, then syntax is
                               ", name1 type1, name2 type2, ..."
                           see doc of CREATE TABLE on how to define columns
   --stconfig <file>       standardizer configuration. If not given, default
                           standardization is used.
   --relative              only treats as absolute stereo if chiral flag set.
   --ctcolcfg <cols-exprs> a semicolon separated list of pairs of
                           column names and Chemical Terms expressions.
                           Each pair specifies that the values of the given
                           column should be automatically calculated using
                           the given Chemical Terms expression.
                           In each pair, the column name and the
                           Chemical Terms expression is separated with
                           an equal sign ('=').
   --t:<type>              the type of the structure table
          --t:molecules    Specific structures, like single molecules,
                           mixtures, salts, polymers
          --t:any          All types of structures are allowed, but no
                           structure type-specific searching
          --t:reactions    single step reactions
          --t:markush      for the storage of Markush structures (this table type
                           is not allowed for Ms Access dbms)
          --t:query        query structures
   --tdf:[y/n]             specify "y" to consider tautomers during duplicate
                           filtering (default is "n")

Option for regeneration :
   --stconfig <file>       standardizer configuration. If not given, there is
                           no change in the standardizer configuration.
                           Specify "reset_to_default" to change to default
                           standardization.

Options for import:
   --connect <connections> non-default SDFile and table field connections
                           (identical field names paired otherwise)
   --skip <n>              skip the first n molecules in SD file
   --lines <n>             check only the first n lines for field names
   --nodup                 does not import molecules already in database
   --noempty               does not import empty molecules
   --setchiralflag         sets chiral flag for MDL file formats.
   --diff                  does not import ANY molecules, only output
   --duplicates            write duplicate molecules to output
   --nonduplicates         write non-duplicate molecules to output
   --output                redirects output to a file (otherwise stdout)

Options for export:
   --where <condition>     where clause. Example: --where "cd_id<1000"

Options for miscellaneous table management operations (command 'm'):
   --add-ctcolcfg <opt>    adds Chemical Terms expressions to existing user-
                           defined columns. <opt> has the same format as the
                           argument to the --ctcolcfg parameter of the 'c'
                           (create table) command. The columns specified
                           must have no Chemical Terms expression
                           assigned to them.
   --set-ctcolcfg <opt>    sets Chemical Terms expressions to existing user-
                           defined columns. <opt> has the same format as the
                           argument to the --ctcolcfg parameter of the 'c'
                           (create table) command. The columns specified may
                           have Chemical Terms expressions already assigned
                           to them.
   --del-ctcolcfg <opt>    removes the assignment of Chemical Terms expressions
                           from existing user-defined columns. <opt> is
                           a semi-colon-separated list of column names.


Examples:
$ jcman c strdata --coldefs ", name CHAR(200), stock INTEGER"\
   --fplength 16 -l joe --driver org.gjt.mm.mysql.Driver\
   --dburl jdbc:mysql://localhost/mydb -s
$ jcman c ctcols --coldefs ", logp numeric(18,9), rotbl_bnd_cnt numeric(1,0)"\
   --ctcolcfg "logp=logp();rotbl_bnd_cnt=rotatableBondCount()>4"
...
$ jcman a str3d str3d.sdf.gz
$ jcman a str3d str3d.sdf.gz --connect "MOLNAME=NAME;PH=PH_VAL"
$ jcman t
    Table name              Rows
  1 str3d                 198556
  2 testdata                4260
$ jcman t ncidata
    Column name     Type name
  1 cd_id           LONG
  2 cd_structure    BLOB
...
 18 cd_fp16         LONG
$ jcman m strtable --add-ctcolcfg "logp=logp();rotbl_bnd_cnt=rotatableBondCount()>4"

3. Connecting to Databases

After starting JChemManager the "Connecting to a Database" dialog box appears. The dialog box can also be displayed by selecting the Connect icon on the tool bar of JChemManager.

After filling the form and selecting Ok, the system attempts to connect to a database using the settings entered. JChem uses the JDBC protocol to connect to relational databases. Before trying to connect, make sure that the appropriate JDBC driver is included in the CLASSPATH environmental variable. To establish a JDBC connection the following parameters have to be set:

JDBC driver:

A Java class name, the entry point of the JDBC driver has to be specified here. If a JDBC driver is available for the database, consult the documentation of the driver for the proper name. If you would like to use an ODBC driver, enter sun.jdbc.odbc.JdbcOdbcDriver. See FAQ for more details on JDBC and ODBC drivers. You can find the most common driver names here.

URL of database: 

A JDBC URL provides a way of identifying a database so that the appropriate driver will recognize it and establishes a connection with it. Please check the documentation of the driver to determine what the JDBC URL that identifies the particular driver will be.

In the case of an ODBC driver, the full syntax is

jdbc:odbc:<data-source-name>[<attributes>]
where each attribute has the following form
;<attribute-name>=<attribute-value>

For other common URL formats please click here.

Property table:

Enter the name of the property table. The default value is JChemProperties. Since version 1.6 this can be changed by the user to support flexible multiuser capabilities.

Login name:

Enter a user ID needed to enter the database. If a login name is not needed, then leave the field empty

Password:

Enter the password for the login name. If you want the system to save the password in the .jchem file for later use, check Remember password.

The above settings are stored in the .jchem file located under "chemaxon" or ".chemaxon" in your home directory. At the first connection, a new table called JChemProperties (or the name you specify) is generated in the database, which contains parameters of structure tables.

4. Creating Structure Tables

Structure tables may contain several types of data, but the following columns must always be present for JChem:

cd_id (JDBC type: INTEGER)

Provides a unique identifier of the compound. If no value is specified for cd_id during the insertion of new structures, then the value is incremented automatically.

cd_structure (JDBC type: LONGVARBINARY)

Stores the structure in the original input format. MDL Molfiles and SDfiles are stored in compressed Molfile (csmol) form. Used for displaying the structure and, in some cases, for searching (only when cd_smiles is not available).

cd_smiles (JDBC type: VARCHAR(1000))

Stores the standardized structure in ChemAxon Extended SMILES format. Since this storage form is very compact, it enables fast structure searching. (If the SMILES code of a molecule is larger than the maximum length of the cd_smiles column, then NULL is stored and the cd_structure field is used during the search.)

>cd_formula (JDBC type: VARCHAR(100))

The molecular formula of the molecule. The atomic symbols are in Hill Order: C is listed first, followed by H, followed by the remaining elements in alphabetical order.

cd_molweight (JDBC type: DOUBLE or FLOAT)

The molecular weight. If an application uses cd_formula or cd_molweight often for SQL queries, it is suggested to create indexes on these columns (using the SQL CREATE INDEX statement).

cd_fp1, cd_fp2, cd_fp3, ...cd_fpn (JDBC type: INTEGER)

The chemical hashed fingerprint of a compound is stored in several INTEGER columns.

For reaction tables the reaction fingerprint of the molecule is stored instead to improve reaction similarity search.

A chemical hashed fingerprint is a bit string that contains structural information on the molecule. It enables fast substructure and similarity searching. Such binary fingerprint is generated when a new molecule is added or a structure is updated. The first phase of substructure search is screening, when the fingerprints of the molecules and the query structure are compared. If

    fingerprint(molecule) & fingerprint(query) != fingerprint(query) 
then the molecule doesn't contain the query. (In the above formula "&" is the bitwise AND operator and "!=" means not equal.) Since screening may provide false hits, an atom-by-atom search follows this phase. The performance of the search is better if the number of false hits is lower.

Usually the performance of screening improves by increasing the size of the fingerprint (that is the number of cd_fpi columns). However, especially if the structure table is large, too many fingerprint columns may even cause slower searching, because of the higher demand of memory space during the search process. As a consequence, the fingerprints of only a part of the molecules may be cached in memory by the RDBMS. Maximum pattern length and the number of bits set for patterns also affect the quality of screening. These settings should depend on the average size of the structures in a structure table.

See more details in the section Parameters for Generating Chemical Hashed Fingerprints.

Other columns, like compound name or physico-chemical constants, may also be added to the table.

To create a structure table select the Create icon. The Create a Table dialog box appears.

Parameters to be specified:

Name of new table:

The name of the table to be created in the database that was specified at the Connecting to a Database dialog box.

Fingerprint length*:

Specify the number of INTEGER columns that will contain the chemical hashed fingerprints of the molecules. Higher number provides better screening performance for substructure searching, but too many columns may significantly increase the size of the structure cache.

Bits to be set for patterns*:

The number of bits to be set for each pattern in the structure. In the case of 16 fingerprint columns, 1 or 2 bits for a pattern are the best choice. Too high or too low value causes too many or too few 1 bits, respectively. The performance of the screening is optimal when the average number of 1 bits is about 50%.

Maximum pattern length*: 

The number of edges in a pattern.

Custom standardization: 

You may specify a custom standardization XML for the structure table. You have to regenerate the table, if you want to change the standardization.

Assume absolute stereo flag: 

If checked, all query and target structures are treated as absolute stereo. This setting can be changed later without regenerating the table.

Structural keys: 

You can specify here a fix set of structures in a file that will be used as structural keys. The fingerprint will be extended with the appropriate number of integer columns to provide 1 bit for each structure. If the imported structure contains the key as a substructure the bit will set to 1, otherwise to 0. A new fingerprint column will be added for every 32 keys. Important features of structural keys:

  • If you run a substructure search against the structure table and the query structure is identical to one of the structural keys, the time of the search will be close to zero (in cached mode). This is because the substructure search was already performed at import, and JChem only has to check whether the specified bit is set to 1. This is useful if you frequently run substructure searches on the table using the same set of query structures.
  • If the query is not part of the structural key set, these keys are also considered for substructure and superstructure searches. Do not expect a major improvement in the effectiveness of screening in this case though, since the chemical hashed fingerprints are already very effective for most query structures.
  • During similarity search the structural key part of the fingerprint is not considered (dissimilarity is only calculated from the chemical hashed fingerprint part).
  • The speed of the import will slow down depending on the number of specified keys.
  • The required memory for the structure cache will increase with the increased number of fingerprint columns.

Duplicate filtering uses tautomers: 

If checked tautomers are considered for duplicate filtering during import. Enabling this feature increases import time.

Table type: 

You can select from the following table types according to the desired scope of use:

  • Molecules (default): For the storage of specific structures, like single molecules, mixtures, salts and polymers.
  • Reactions: Table for storing single step reactions.
  • Any structures: All types of structures are allowed, but no structure type-specific searching takes place (e.g. similarity values for reactions will not distinguish reactants, products and reaction centers).
  • Markush libraries: Table for storing markush structures. (this table type is not allowed for Ms Access dbms)
  • Query structures: Table for storing query structures. Typically used for superstructure search. Note: SMILES string imported into this table will be interpreted as SMARTS.

Compatibility notes: Tables created before JChem version 3.2 will be treated as "Any structures" to maintain previous behaviour. The default type for new tables is "Molecules".

* The GenerateMD application can help determine the parameters providing the best performance. Statistics about fingerprint darkness of existing tables can also be obtained via running jcman s <table_name>. If you haven't made any previous testing, use the defaults that are optimized for typical compounds of pharmaceutical interest. The default values differ according to table type.

After pressing Ok in the Create a Table dialog box, the SQL statement for creating the structure table is displayed in the Create Table Statement dialog box.

If you would like to add more columns, then modify the SQL statement (though this can also be done later in most RDBMS-s).

If you would like to have Chemical-Terms-based columns (additional columns, the values of which are automatically calculated based on Chemical Terms expressions), you can specify them at the bottom of the dialog along with their respective Chemical Terms expressions. (The columns appearing in the bottom of the dialog as Chemical-Terms-based columns, must be also specified in the CREATE TABLE statement above. If you add columns later using your RDMBS and decide to make them Chemical-Terms-based columns, you can configure the Chemical Terms expressions for the new columns using the command line version of JChem Manager.)

NOTE:

  1. The definitions of additional columns should start after the definitions of fix columns (highlighted in the screen-shot).
  2. The name of extra columns should not start with "cd_".

There is a default limit on the length of the field cd_smiles for most RDBMS-s. If the majority of your molecules' SMILES representation is longer than this limit (in case of HUGE molecules), the search process can become slower. In this case you may try to increase the limit.

After selecting Ok, the SQL statement is executed.

For each structure table, the fingerprint properties are stored in the JChemProperties table. If the RDBMS supports schemata, username is also attached to the table name in the name column of the property table, like in the following example:

prop_name

prop_value

table.CDUSER.STRUCTURES.fingerprint.numberOfBits

512

table.CDUSER.STRUCTURES.fingerprint.numberOfEdges

6

table.CDUSER.STRUCTURES.fingerprint.numberOfOnes

2

5. Importing Structure Files

Import formats in JChemManager:

When the Import icon is selected, the "Import" dialog box appears.

Specify the database table and the input file. The data fields in the file can be imported into the columns of the table.

To support connecting the corresponding field names and column names, the program will detect field names in the file. If the file is too big, checking may be time consuming. Use the Check whole file for field names in selected file check box and the Number of lines to check input box to decide whether you want the whole file or just a given number of lines to be searched for field names.

If an error occurs during import, the error message and the corresponding stack trace information is written to the standard error. Check the Halt if an error occurs box if you would like the system to stop if a molecule can not be imported.

When Allow duplicate structures is unchecked, JChemManager will not import a structure if the database contains another structure with the same topology.

If Allow empty structures is unchecked, JChemManager will not import empty structures (structures where the atom count is zero).

If Set chiral flag for MDF formats is checked, JChemManager will set the chiral flag (absolute stereo flag) for the imported structures.

In case of SMILES*, Molfiles, RDfiles and Marvin Documents importing starts after selecting the Ok button.

For SDfiles and JTF, the "Connecting Fields" window appears, where you can connect the corresponding field names and column names.

* Some SMILES may contain additional data separated by whitespace from the structure string. The additional data columns are separated by tabulators ("\t"). In this case, the "Connecting Fields" dialog will also appear, the fields will be named as "field_0", "field_1" etc.

Clicking on a cell in the Field in file column of the window, a list box appears with the alternative field names. In the case of the cd_id column, Auto-incrementing can also be selected, which means that the value is increased by one after each new record. This works even if the RDBMS does not support auto-incrementing, because in this case JChemManager will take care of incrementing the value. (Some RDBMS-s that have the auto-incrementing feature for columns do not allow the explicit setting of cd_id.)

Importing starts after selecting the Ok button. A progress window displays the progress of the import.

6. Exporting Structure Files

Export formats in JChemManager:

When the Export icon is selected, the "Export" dialog box appears.

Specify the database table and the output file.
File format is determined from the extension:

Format Extension Examples
MDL SDfile extension starts with "sdf" .sdf, .sdfile
MDL Molfile extension starts with "mol" .mol, .molfile
MDL RDfile extension starts with "rdf" .rd, .rdf
MDL Rxnfile extension starts with "rxn" .rxn, .rxnfile
SMILES extension starts with "smi" .smi, .smiles
Marvin Document extension is "mrv" .mrv
JTF extension starts with "jtf" .jtf, .jtfile

After pressing the Ok button, the next dialog appears.

On the left panel, you can specify which fields to export in the case of formats that support additional data. By default, cd_id and the additional fields are selected (fields not beginning with "cd_"). You can add more fields by pressing the Add button. To remove one ore more fields, select them and press the Remove button. You can restore the default setting by pressing Reset. The Sort button arranges fields according to the original order in the database rows.

On the right panel you can specify a criteria for molecules to be exported in the form of an SQL WHERE statement.

The last dialog follows, where you can specify

Exporting starts after selecting the Ok button. A progress window displays the progress of the export.

7. Deleting Structure Tables

When the Delete icon is selected, the "Delete" dialog box appears.

In this dialog you can

8. Modifying Structure Tables

In this version of JChem updating and inserting rows in structure tables is not yet supported in JChemManager. However, custom applications using the JChem Class Library can perform these operations. See the examples included with the package.

9. Changing Table Settings

The table options dialog can be reached from the File -> Table Options dialog of JChemManager.

Change the table name in the combo box on the top to view / edit settings for other tables.

Changing some settings (e.g. standardization, tautomer duplicate filtering) requires the regeneration of the table. This will be performed after pressing the "Ok" button. The regeneration can take considerable amount of time depending on the size of the structure tables and other factors.

NOTE: you can make changes for multiple tables, your changes will be stored when selecting other tables. The actual changes in the database and the regeneration (if needed) will take place for all tables after pressing the "OK" button.

If "Assume absolute stereo flag" is set for a table, all query and target structures are treated as absolute stereo ("chiral flag" in MDL files).
Changing this setting does not require regeneration.

One can also specify if tautomers are considered during duplicate filtered import. Enabling this feature may slow down import significantly.
Changing this setting requires regeneration.

You can also change the standardization of the tables.
Changing this setting requires regeneration.

10. Setting Options

When the Options icon is selected, the "Options" dialog box appears. The options set here are stored in the property table.

NOTE: If you are using multiple property tables these options should be set for each property table individually.

Advanced:

You can set advanced options in the second tab.

  • Compression: Certain file formats are compressed before storing the structure source in the cd_structure field in the database. This has the following benefits:
    • Saves disk space in the database.
    • The compressed files can be directly sent to Marvin applets for display (text-based compression is used). This can reduce the download time for generated web pages.
    The compression is enabled by default. Sometimes you may need to disable this compression, e.g. third-party software accesses the structures which cannot interpret compressed ChemAxon format. After disabling the compression all new structures are inserted without compression.
  • Commit interval: A database option, which defines the batch size for import. Your compounds will be committed in accordingly sized batches to the database during the import process. The value of setting may effect the speed of import and the efficiency of duplicate filtering in case of parallel import. If this value is increased the speed of import will increase but there will be a slightly bigger chance of importing duplicates in spite filtering (if a duplicate comes in process A which was not committed in process B yet).
  • Number of update logs to keep: To enable the incremental update of the structure cache, the changes of each structure table are logged into a log table. When the next search is performed after changes, the cache is updated, and most of the old logs are deleted. If two or more applications cache the table separately, it can be useful to keep some of the latest logs for the other application(s): this number specifies how many logs should be kept.

11. Regeneration of Fix Columns

When a new version of JChem is released usually the calculated data in the structure tables have to be refreshed to be consistent with the new version. Normally this is offered by JChem Manager when connecting, but in some cases one may want to initiate regeneration by hand.

To regenerate fix columns, select the File -> Regenerate menu option, the "Regenerate" dialog will appear:

You can select one table, or regenerate all tables.

You can also regenerate the tables from command-line.



 
Copyright © 1999-2008 ChemAxon Ltd.    All rights reserved.