Jarp performs variable-length Jarvis-Patrick clustering based on fingerprints and/or other data stored in a database table or a file. The software can also be used for calculating diversity measures, like average and minimum dissimilarity of a compound library. This document mentions molecules as the entities to be clustered, but the software can also be used for other types of objects.
The original Jarvis-Patrick algorithm proceeds as follows:
Nearest neighbor searching finds molecules that are similar to the query object. The calculation applies the Tanimoto (or Jaccard) coefficient that is calculated by the following formula in the case of binary fingerprint (bit string) input:
T(A,B) = NA&B/(NA+NB-NA&B)
where NA and NB are the number of bits set in the bit strings of molecules A and B, respectively, NA&B is the number of bits that are set in both.
When only binary fingerprints are used for the calculation of the dissimilarity between molecules, then the formula of the dissimilarity of molecule A and B is
D(A,B) = 1-T(A,B)
where T(A,B) is the Tanimoto coefficient for molecule A and B.
When other types of columns are (also) used, a weighted Euclidean distance calculation is applied:
D(A,B) = sqrt{[1-T(A,B)] + w1[C1(A)-C1(B)]2 + w2[C2(A)-C2(B)]2 + ...}
where
Instead of the brute force method, Jarp applies heuristics to avoid calculating all pairwise dissimilarity calculations and neighbor list comparisons. According to our measurements, the speed of clustering is O(n1.5).
jarp [<options>]
Prepare the usage of the jarp script or batch file
as described in
Preparing the Usage of JChem Batch Files and Shell
Scripts.
Or call the JarvisPatrick class directly:
Win32 / Java 2 (assuming that JChem is installed in c:\jchem):
java -cp "c:\jchem\lib\jchem.jar;%CLASSPATH%" chemaxon.clustering.JarvisPatrick [<options>]
Unix / Java 2 (assuming that JChem is installed in /usr/local/jchem):
java -cp "/usr/local/jchem/lib/jchem.jar:$CLASSPATH" \
chemaxon.clustering.JarvisPatrick [<options>]
Because the utility has many parameters, it may be reasonable to create a shell script or a batch file for calling the software.
General options:
-h --help this help message
-d --driver <JDBC driver> JDBC driver
-u --dburl <url> URL of database
-l --login <login> login name
-p --password <password> password
-P --proptable <tablename> name of property table
-s --saveconf save settings into ~/.jchem
Input options (default: standard input):
-i --input <filepath> input file to cluster (text file input)
-q --query <sql> SQL query for reading input
(database input)
Output options (default: standard output):
-o --output <filepath> output file path (text file output)
-a --statement <sql> SQL statement for inserting results
(database output)
-x --central calculate and sign central objects
-y --singlet singletons get negative cluster ids
-z --statistics print statistics
-Z --only-statistics print only statistics
-v --verbose verbose output
Data properties
-m --dimensions <dim> number of floating-point descriptors
-f --fingerprint-size <bits> binary fingerprint size in bits.
fpsize should be a multiple of 32
-w --weights <w1> <w2> ... the weights of the floating-point descriptors
-g --generate-id generate id for each compound.
Clustering conditions
-t --threshold <threshold> maximum dissimilarity of two compounds
-c --common <ratio> minimum ratio of common neighbors of two
compounds.
Warning! Without a valid license key, the software is in demo mode and maximum 1000 structures can be retrieved from the database.
jarp script at each run.
To overcome this problem, it is possible to save
some of the settings that are not changed frequently
in the .jchem file stored in the user's
home directory. Use the --saveconf option to
store the following settings:
--driver)
--dburl)
--login)
--password)
--fingerprint-size)
The settings needed for the database connection are also modified and saved by JChemManager. If you successfully entered into the database using JChemManager, then you don't need to set connection for Jarp manually.
For more information on setting connection parameters:
--driver)
--dburl)
--login)
--password)
The software may import data from either a text file
(--input) or a database (--query).
The input data must contain the following columns:
| Columns | Type | Content |
|---|---|---|
| Id | Integer numbers | Id of compounds (Optional in text files) |
| fp1, fp2, fp3 ... | Integer numbers |
Binary fingerprints in integer number blocks The number of fp. columns is fp. length / 32(Optional) |
| d1, d2, d3, ... | Floating point numbers | Other descriptors (Optional) |
Comments:
--generate-id option if the id column
is missing from the input data.
generatemd c structures.smi -k CF -c cfp.xml -D -o fingerprints.txt
cd_id and
cd_fpi
columns in JChem's structure tables are appropriate as input.
jarp -q "SELECT cd_id, cd_fp1, cd_fp2, cd_fp3, cd_fp4, cd_fp5, cd_fp6 FROM structures" ...
(For the sake of readability only 6 fp. columns is applied in the above example,
but usually this number is much higher.)
The software can write the results of clustering into either a text file
(--output) or a database table
(--statement).
The exported data contains the following columns:
| Columns | Type | Content |
|---|---|---|
| Id | Integer numbers | Identifier of compounds |
| Clid | Integer numbers | Cluster identifier |
| Centr | Integer numbers | Displays whether the object is central |
The last column is written only if the
--central option is specified.
A central object has the smallest sum of dissimilarities
to the other objects in the cluster.
Central object calculation slows down the application
significantly.
Comments for text output:
Comments for database output:
CREATE TABLE clusters (
cd_id INTEGER NOT NULL PRIMARY KEY,
cluster_id INTEGER)
CREATE TABLE clusters (
cd_id INTEGER NOT NULL PRIMARY KEY,
cluster_id INTEGER,
central SMALLINT)
DELETE FROM clusters;
-a option),
which inserts the rows containing the results.
jarp -a "INSERT INTO clusters(cd_id, cluster_id, central) VALUES(?,?,?)" ...
The "?" symbols will be substituted with the corresponding
values.
SELECT * FROM clusters WHERE cluster_id = 1
central column is Optionally, Jarp can print clustering statistics into
the standard output or the given output file.
The parameters that enable statistics printing are --statistics
or --only-statistics. (The latter one doesn't allow to
print information on individual compounds.)
The following data will be printed:
The calculation is significantly slower if statistics is enabled, since all pairwise dissimilarity values have to be calculated. (Heuristics cannot be applied.)
--fingerprint-size
--dimensions
--weights
--threshold
--common
By default, the heap size in some Java runtime environments is limited to 64MB, so you may run out of memory easily. See the FAQ on increasing the heap size.
--common option) correctly, is important
in fine tuning the clustering process.
Since nearest neighbor searching is much more time consuming
than the clustering stage, it is reasonable to separate
the two processes.
In that case clustering can be run several times with different
--common settings.
NNeib is a command line application that collects and stores the
nearest neighbor list in a text file. If this file is fed into Jarp,
the nearest neighbor search is omitted.
NNeib has the same command line parameters than Jarp, however, the
--common
--statistics
--only-statistics
--singlet
options are not available. Run
nneib -h
for help on input parameters.
If the threshold value (--threshold) is not specified
for Jarp, then
--statistics and --only-statistics
options, diversity statistics is not printed
--central) is not available
--query--weights--generate-id--dimensions--fingerprint-size
In the examples it is supposed that all connection parameters are set and stored by JChemManager (or a previous saving by Jarp)
set QUERY="SELECT cd_id, cd_fp1, cd_fp2, cd_fp3, cd_fp4, cd_fp5, cd_fp6, cd_fp7, cd_fp8, cd_fp9, cd_fp10, cd_fp11, cd_fp12, cd_fp13, cd_fp14, cd_fp15, cd_fp16 FROM structures WHERE cd_id < 10000" jarp -q %QUERY% -t 0.1 -c 0.3 -f 512
QUERY="SELECT cd_id, cd_fp1, cd_fp2, cd_fp3, cd_fp4, cd_fp5, cd_fp6, cd_fp7, cd_fp8, cd_fp9, cd_fp10, cd_fp11, cd_fp12, cd_fp13, cd_fp14, cd_fp15, cd_fp16 FROM structures WHERE cd_id < 10000" INSERT="INSERT INTO clusters(cd_id, cluster_id) VALUES(?,?)" jarp -q "$QUERY" -a "$INSERT" -t 0.1 -c 0.3 -f 512
Make sure that the clusters table exists and is empty
before running the script.
generatemd c input.smi -c CF -k cfp.xml -D | jarp -f 512 -t 0.1 -c 0.3 -g
generatemd c input.smi -c PF -k pharma-frag.xml -D | jarp -f 0 -m 210 -t 0.1 -c 0.3 -g
-c parameters.
Using the output of NNeib.
Singletons get negative cluster ids.
generatemd c input.smi -k CF -c cfp.xml -D -o fingerprints.txt nneib -f 512 -t 0.1 -g <fingerprints.txt >neighborlists.txt jarp -c 0.2 -y <neighborlists.txt >clusters.0.2.txt jarp -c 0.3 -y <neighborlists.txt >clusters.0.3.txt jarp -c 0.4 -y <neighborlists.txt >clusters.0.4.txt
generatemd c input.sdf -k CF -c cfp.xml -D -o fingerprints.txt jarp -g -t 0.1 -c 0.3 -f 512 < fingerprints.txt > clusters.txt
crview -i id -c "clid=1" -s input.sdf -t clusters.txt >jarp_result1.sdf
mview -c 3 -r 3 -f NSC jarp_result1.sdf
generatemd c input.sdf -k CF -c cfp.xml -D -o fingerprints.txt jarp -g -t 0.1 -c 0.3 -f 512 -x -z < fingerprints.txt > clusters.txt
crview -i "centr:2" -c "size>=20" -d "clid:size" -s input.sdf -t clusters.txt >jarp_result1.sdf
mview -c 3 -r 3 -f "NSC:clid:size" jarp_result2.sdf