Compr compares two sets of objects (like compound libraries) using diversity and dissimilarity calculations.
The algorithm applies nearest neighbor searching that finds molecules similar to the query object. The calculation applies the Tanimoto (or Jaccard) coefficient that is calculated by the following formula in the case of binary fingerprint (bit string) input:
T(A,B) = NA&B/(NA+NB-NA&B)
where NA and NB are the number of bits set in the bit strings of molecules A and B, respectively, NA&B is the number of bits that are set in both.
When only binary fingerprints are used for the calculation of the dissimilarity between molecules, then the formula of the dissimilarity of molecule A and B is
D(A,B) = 1-T(A,B)
where T(A,B) is the Tanimoto coefficient for molecule A and B.
When other columns are also used, a weighted Euclidean distance calculation is applied:
D(A,B) = sqrt{[1-T(A,B)] + w1[C1(A)-C1(B)]2 + w2[C2(A)-C2(B)]2 + ...}
where
Instead of the brute force method, Compr applies heuristics to avoid calculating all pairwise dissimilarity calculations and neighbor list comparisons.
compr [<options>]
Prepare the usage of the compr script or batch file
as described in
Preparing the Usage of JChem Batch Files and Shell
Scripts.
Or call the Compare class directly:
Win32 / Java 2 (assuming that JChem is installed in c:\jchem):
java -cp "c:\jchem\lib\jchem.jar;%CLASSPATH%" chemaxon.clustering.Compare [<options>]
Unix / Java 2 (assuming that JChem is installed in /usr/local/jchem):
java -cp "/usr/local/jchem/lib/jchem.jar:$CLASSPATH" \
chemaxon.clustering.Compare [<options>]
Because the utility has many parameters, it may be reasonable to create a shell script or a batch file for calling the software.
General options:
-h --help this help message
-d --driver <JDBC driver> JDBC driver
-u --dburl <url> URL of database
-l --login <login> login name
-p --password <password> password
-P --proptable <tablename> name of property table
-s --saveconf save settings into ~/.jchem
Input options (default: standard input):
-i --input <path1> <path2> input files to compare (text file input)
-q --query <sql1> <sql2> SQL query strings for reading input
(database input)
Combined input (table vs. file comparison):
-i <path> -q <sql>
Output options (default: standard output):
-o --output <filepath> output file path (text file output)
-a --statement <sql> SQL statement for inserting results
(database output)
-z --stat print statistics on overall dissimilarity
-Z --only-stat statistics only on overall dissimilarity
-y --only-dissimilar print only dissimilar objects
-L --list-similar list similar objects from the first set
-v --verbose verbose output
Data properties
-m --dimensions <dim> number of floating-point descriptors
-f --fingerprint-size <bits> binary fingerprint size in bits.
fpsize should be a multiple of 32
-w --weights <w1> <w2> ... the weights of the floating-point descriptors
-g --generate-id generate id for each compound.
Conditions of calculation
-t --threshold <threshold> maximum dissimilarity of two compounds
compounds (default: 1)
-D --different-ids don't compare compounds with the same id
-c --max-count <count> maximum number of similar objects listed
-r --order sort similar objects by distance (closest first)
Warning! Without a valid license key, the software is in demo mode and maximum 1000 structures can be retrieved from the database.
compr script at each run.
To overcome this problem, it is possible to save
some of the settings that are not changed frequently
in the .jchem file stored in the user's
home directory. Use the --saveconf option to
store the following settings:
--driver)
--dburl)
--login)
--password)
--fingerprint-size)
The settings needed for the database connection are also modified and saved by JChemManager. If you successfully entered into the database using JChemManager, then you don't need to set connection for Compr manually.
For more information on setting connection parameters:
--driver)
--dburl)
--login)
--password)
--proptable)
Two data sources are retrieved, which contain
the data of molecule libraries to be compared.
The software may import data from either text files
(--input) or database tables
(--query).
Each input data source must contain the following columns:
| Columns | Type | Content |
|---|---|---|
| Id | Integer numbers | Id of compounds (Optional in text files) |
| fp1, fp2, fp3 ... | Integer numbers |
Binary fingerprints in integer number blocks The number of fp. columns is fp. length / 32(Optional) |
| d1, d2, d3, ... | Floating point numbers | Other descriptors (Optional) |
Comments:
--generate-id option if the id column
is missing from the input data.
generatemd c structures.smi -k CF -c cfp.xml -D -o fingerprints.txtA sample XML configuration file (
cfp.xml) can be found in the
config directory under the examples directory.
cd_id and
cd_fpi
columns in JChem's structure tables are appropriate as input.
compr -q "SELECT cd_id, cd_fp1, cd_fp2, cd_fp3, cd_fp4, cd_fp5, cd_fp6 FROM structures" ...
(For the sake of readability only 6 fp. columns is applied in the above example,
but usually this number is much higher.)
The software can write the results of the calculation
into either a text file
(--output) or a database table
(--statement).
Each row of the output belong to a compound in the second set. The following symbols are used in the description of columns:
| L1, L2 | The compound libraries specified in this order in the call of Compr |
| C | The compound from L2, which belongs to the specific row | D(C,Ai) | The dissimilarity between C and compound i from L1 |
The exported data contains a subset of the following columns:
| Columns | Type | Content |
|---|---|---|
| Id | Integer numbers | The identifier of C |
| minD | Floating point numbers | The dissimilarity of C and its nearest neighbor from L1: min(D(C,Ai)) |
| nneib | Integer numbers | The nearest neighbor of C in L1 |
| simcnt | Integer numbers | The number of objects in L1, which are similar to C. (Their dissimilarity is lower than or equal to the specified threshold value.) |
| avgD | Floating point numbers | The average dissimilarity of C and all compounds from L1: sum(D(C,Ai))/N(L1) |
| maxD | Floating point numbers | The maximum dissimilarity between C and all compounds from L1: max(D(C,Ai)) |
| list_of_similar_objects | Integer numbers in several columns | The list of molecules with a dissimilarity below the threshold (neighbors) in one row. |
Only the column "Id" is printed if the
--different-ids option is specified.
"avgD" and "maxD" columns are written only if either the
--statistics or the --only-statistics
option is given. The "list_of_similar_objects" columns
are printed if the --list-similar
option is specified.
Comments for database output:
--different-ids option is specified)
CREATE TABLE compr_result (
cd_id INTEGER NOT NULL PRIMARY KEY)
--different-ids, --statistics,
and --only-statistics is specified)
CREATE TABLE compr_result (
cd_id INTEGER NOT NULL PRIMARY KEY,
minD FLOAT,
nneib INTEGER,
simcnt INTEGER)
--statistics is specified)
CREATE TABLE compr_result (
cd_id INTEGER NOT NULL PRIMARY KEY,
minD FLOAT,
nneib INTEGER,
simcnt INTEGER,
avgD FLOAT,
maxD FLOAT)
DELETE FROM compr_results;
-a option),
which inserts the rows containing the results.
compr -a "INSERT INTO compr_results(cd_id, minD, nneib, simcnt) VALUES(?,?,?,?)" ...
The "?" symbols will be substituted with the corresponding
values.
SELECT * FROM compr_results WHERE minD > 0.05
--list-similar option, because in then the number of
columns is not fix.
Optionally, Compr can print diversity statistics into
the standard output or the given output file.
The parameters that enable statistics printing are --statistics
or --only-statistics. (The latter one doesn't allow to
print information on individual compounds.)
The following data will be printed:
The calculation is significantly slower if statistics is enabled, since all pairwise dissimilarity values have to be calculated. (Heuristics cannot be applied.)
--fingerprint-size
--dimensions
--weights
--threshold
--different-ids
By default, the heap size in some Java runtime environments is limited to 64MB, so you may run out of memory easily. See the FAQ on increasing the heap size.
In the examples it is supposed that, when needed, all connection parameters are set and stored by JChemManager (or a previous saving by Compr)
commercial_library table
that are similar to the structures in the
home_library table to the standard output:
set QUERY1="SELECT cd_id, cd_fp1, cd_fp2, cd_fp3, cd_fp4, cd_fp5, cd_fp6, cd_fp7, cd_fp8, cd_fp9, cd_fp10, cd_fp11, cd_fp12, cd_fp13, cd_fp14, cd_fp15, cd_fp16 FROM home_library" set QUERY2="SELECT cd_id, cd_fp1, cd_fp2, cd_fp3, cd_fp4, cd_fp5, cd_fp6, cd_fp7, cd_fp8, cd_fp9, cd_fp10, cd_fp11, cd_fp12, cd_fp13, cd_fp14, cd_fp15, cd_fp16 FROM commercial_library" compr -q %QUERY1% %QUERY2% -t 0.1 -f 512 -D
home_library and commercial_library)
and insert the results into another table.
It collects id-s and calculates dissimilarity information on
structures in the commercial_library,
which are similar to some structures in the
home_library:
QUERY1="SELECT cd_id, cd_fp1, cd_fp2, cd_fp3, cd_fp4, cd_fp5, cd_fp6, cd_fp7, cd_fp8, cd_fp9, cd_fp10, cd_fp11, cd_fp12, cd_fp13, cd_fp14, cd_fp15, cd_fp16 FROM home_library" QUERY2="SELECT cd_id, cd_fp1, cd_fp2, cd_fp3, cd_fp4, cd_fp5, cd_fp6, cd_fp7, cd_fp8, cd_fp9, cd_fp10, cd_fp11, cd_fp12, cd_fp13, cd_fp14, cd_fp15, cd_fp16 FROM commercial_library" INSERT="INSERT INTO simil(cd_id, minD, nneib, simcnt) VALUES(?,?,?,?)" compr -q "$QUERY1" "$QUERY2" -a "$INSERT" -t 0.1 -f 512
Make sure that the simil table exists and is empty
before running the script.
generatemd c home_library.smi -k CF -c cfp.xml -D -o home_library.fp generatemd c commercial_library.smi -k CF -c cfp.xml -D -o commercial_library.fp compr -f 512 -t 0.1 -g -z -i home_library.fp commercial_library.fp >compr_result.tabExample for the result file:
id minD nneib simcnt avgD maxD 1 0.165 40092 0 0.531 0.770 2 0.056 8998 1 0.546 0.766 .... 999 0.095 365 1 0.521 0.804 1000 0.103 221 0 0.527 0.821 STATISTICS Number of objects in set 1 = 90000 Number of objects in set 2 = 1000 Minimum dissimilarity between sets = 0.0057803392 Average dissimilarity between sets = 0.543819 Maximum dissimilarity between sets = 0.82962966
compr -f 512 -t 0.1 -g -z -i home_library.fp home_library.fp >compr_result.tab
crview -i id -d "minD:avgD:simcnt" -s commercial_library.sdf -t compr_result.tab > compr_result.sdf
mview -r 3 -f "CAS_RN:minD:avgD:simcnt" compr_result.sdf