Jarp

Version 5.0.3

Contents

 

Introduction

Jarp performs variable-length Jarvis-Patrick clustering based on fingerprints and/or other data stored in a database table or a file. The software can also be used for calculating diversity measures, like average and minimum dissimilarity of a compound library. This document mentions molecules as the entities to be clustered, but the software can also be used for other types of objects.

The original Jarvis-Patrick algorithm proceeds as follows:

  1. Calculate the set of K nearest neighbors for each structure.
  2. Two structures cluster together if
    • they are in each others list of nearest neighbors
    • they have at least Kmin of their J nearest neighbors in common.
Kmin and K are adjustable parameters. Several variations of the Jarvis-Patrick algorithm exist. The one applied in Jarp is based on variable-length lists of nearest neighbors:
  1. For each structure, collect the set of nearest neighbors that has a dissimilarity (distance) less than a T threshold value.
  2. Two structures cluster together if
    • they are in each others list of nearest neighbors
    • they have at least Rmin of their nearest neighbors in common, where Rmin is a ratio of the length of the shorter list
T and Rmin are adjustable parameters. The advantages of the latter method are explained by Brown and Martin and Barnard and Downs. The original algorithm has the following shortcomings: The application of variable-length lists solve the above problems.

Nearest neighbor searching finds molecules that are similar to the query object. The calculation applies the Tanimoto (or Jaccard) coefficient that is calculated by the following formula in the case of binary fingerprint (bit string) input:

T(A,B) = NA&B/(NA+NB-NA&B)

where NA and NB are the number of bits set in the bit strings of molecules A and B, respectively, NA&B is the number of bits that are set in both.

When only binary fingerprints are used for the calculation of the dissimilarity between molecules, then the formula of the dissimilarity of molecule A and B is

D(A,B) = 1-T(A,B)

where T(A,B) is the Tanimoto coefficient for molecule A and B.

When other types of columns are (also) used, a weighted Euclidean distance calculation is applied:

D(A,B) = sqrt{[1-T(A,B)] + w1[C1(A)-C1(B)]2 + w2[C2(A)-C2(B)]2 + ...}

where

Instead of the brute force method, Jarp applies heuristics to avoid calculating all pairwise dissimilarity calculations and neighbor list comparisons. According to our measurements, the speed of clustering is O(n1.5).

 

Usage

    jarp [<options>]

Prepare the usage of the jarp script or batch file as described in Preparing the Usage of JChem Batch Files and Shell Scripts.

Or call the JarvisPatrick class directly:

Win32 / Java 2 (assuming that JChem is installed in c:\jchem):

    java -cp "c:\jchem\lib\jchem.jar;%CLASSPATH%" chemaxon.clustering.JarvisPatrick [<options>]

Unix / Java 2 (assuming that JChem is installed in /usr/local/jchem):

    java -cp "/usr/local/jchem/lib/jchem.jar:$CLASSPATH" \
    chemaxon.clustering.JarvisPatrick [<options>]

Because the utility has many parameters, it may be reasonable to create a shell script or a batch file for calling the software.

Options

General options:
  -h  --help                    this help message
  -d  --driver <JDBC driver>    JDBC driver
  -u  --dburl <url>             URL of database
  -l  --login <login>           login name
  -p  --password <password>     password
  -P  --proptable <tablename>   name of property table
  -s  --saveconf                save settings into ~/.jchem

Input options (default: standard input):
  -i  --input <filepath>        input file to cluster (text file input)
  -q  --query <sql>             SQL query for reading input
                                (database input)

Output options (default: standard output):
  -o  --output <filepath>       output file path (text file output)
  -a  --statement <sql>         SQL statement for inserting results
                                (database output)
  -x  --central                 calculate and sign central objects
  -y  --singlet                 singletons get negative cluster ids 
  -z  --statistics              print statistics
  -Z  --only-statistics         print only statistics
  -v  --verbose                 verbose output

Data properties
  -m  --dimensions <dim>        number of floating-point descriptors
  -f  --fingerprint-size <bits> binary fingerprint size in bits.
                                fpsize should be a multiple of 32
  -w  --weights <w1> <w2> ...   the weights of the floating-point descriptors	
  -g  --generate-id             generate id for each compound.

Clustering conditions
  -t  --threshold <threshold>   maximum dissimilarity of two compounds
  -c  --common <ratio>          minimum ratio of common neighbors of two
                                compounds.

Warning! Without a valid license key, the software is in demo mode and maximum 1000 structures can be retrieved from the database.

 

Loading/Saving of Settings

It would be inconvenient to enter all of the parameters of the jarp script at each run. To overcome this problem, it is possible to save some of the settings that are not changed frequently in the .jchem file stored in the user's home directory. Use the --saveconf option to store the following settings:

The settings needed for the database connection are also modified and saved by JChemManager. If you successfully entered into the database using JChemManager, then you don't need to set connection for Jarp manually.

 

Database Connections

For more information on setting connection parameters:

please visit the Administration Guide of JChem.

 

Input

The software may import data from either a text file (--input) or a database (--query). The input data must contain the following columns:

Columns Type Content
Id Integer numbers Id of compounds
(Optional in text files)
fp1, fp2, fp3 ... Integer numbers Binary fingerprints in integer number blocks
The number of fp. columns is
   fp. length / 32
(Optional)
d1, d2, d3, ... Floating point numbers Other descriptors
(Optional)

Comments:

 

Output

The software can write the results of clustering into either a text file (--output) or a database table (--statement).

The exported data contains the following columns:

Columns Type Content
Id Integer numbers Identifier of compounds
Clid Integer numbers Cluster identifier
Centr Integer numbers Displays whether the object is central

The last column is written only if the --central option is specified. A central object has the smallest sum of dissimilarities to the other objects in the cluster. Central object calculation slows down the application significantly.

Comments for text output:

Comments for database output:

 

Clustering Statistics

Optionally, Jarp can print clustering statistics into the standard output or the given output file. The parameters that enable statistics printing are --statistics or --only-statistics. (The latter one doesn't allow to print information on individual compounds.) The following data will be printed:

The calculation is significantly slower if statistics is enabled, since all pairwise dissimilarity values have to be calculated. (Heuristics cannot be applied.)

 

Comments on Some Parameters

--fingerprint-size
The number of binary fingerprint columns multiplied by 32 (because the bit-length of integer numbers is 32 in Java)
--dimensions
Specifies the number of other columns. If only binary fingerprints are used in the clustering process, then this parameter doesn't have to be set.
--weights
When other columns are used, a weighted Euclidean distance calculation may be applied. If there are also binary fingerprint columns, weights are relative to the Tanimoto coefficient calculated from the binary fingerprints (the Tanimoto coefficient has a weight of 1.0).
--threshold
Same as T in the Introduction. The higher the threshold the longer the lists of neighbors that has to be stored in memory for each compound during clustering. Too long lists of neighbors can cause memory problems.
--common
Same as Rmin in the Introduction.

By default, the heap size in some Java runtime environments is limited to 64MB, so you may run out of memory easily. See the FAQ on increasing the heap size.

 

NNeib: Running Nearest Neighbor Search and Clustering Separately

Setting the minimum proportion of common neighbors (--common option) correctly, is important in fine tuning the clustering process. Since nearest neighbor searching is much more time consuming than the clustering stage, it is reasonable to separate the two processes. In that case clustering can be run several times with different --common settings.

NNeib is a command line application that collects and stores the nearest neighbor list in a text file. If this file is fed into Jarp, the nearest neighbor search is omitted. NNeib has the same command line parameters than Jarp, however, the
 --common
 --statistics
 --only-statistics
 --singlet
options are not available. Run

    nneib -h
for help on input parameters.

If the threshold value (--threshold) is not specified for Jarp, then

 

Examples

In the examples it is supposed that all connection parameters are set and stored by JChemManager (or a previous saving by Jarp)

  1. A batch file (Windows) for reading from a database and writing to the standard output:
    set QUERY="SELECT cd_id, cd_fp1, cd_fp2, cd_fp3, cd_fp4, cd_fp5, cd_fp6, cd_fp7, cd_fp8, cd_fp9, cd_fp10, cd_fp11, cd_fp12, cd_fp13, cd_fp14, cd_fp15, cd_fp16 FROM structures WHERE cd_id < 10000"
    jarp -q %QUERY% -t 0.1 -c 0.3 -f 512
    
  2. A UNIX shell script for reading from a database and writing to another table:
    QUERY="SELECT cd_id, cd_fp1, cd_fp2, cd_fp3, cd_fp4, cd_fp5, cd_fp6, cd_fp7, cd_fp8, cd_fp9, cd_fp10, cd_fp11, cd_fp12, cd_fp13, cd_fp14, cd_fp15, cd_fp16 FROM structures WHERE cd_id < 10000"
    INSERT="INSERT INTO clusters(cd_id, cluster_id) VALUES(?,?)"
    jarp -q "$QUERY" -a "$INSERT" -t 0.1 -c 0.3 -f 512
    

    Make sure that the clusters table exists and is empty before running the script.

  3. Clustering using the output of GenerateMD (in Unix)
    generatemd c input.smi -c CF -k cfp.xml -D | jarp -f 512 -t 0.1 -c 0.3 -g
    
  4. Clustering compounds using their pharmacophore fingerprints (in Unix)
    generatemd c input.smi -c PF -k pharma-frag.xml -D | jarp -f 0 -m 210 -t 0.1 -c 0.3 -g
    
  5. Testing different -c parameters. Using the output of NNeib. Singletons get negative cluster ids.
    generatemd c input.smi -k CF -c cfp.xml -D -o fingerprints.txt
    nneib -f 512 -t 0.1 -g <fingerprints.txt >neighborlists.txt
    jarp -c 0.2 -y <neighborlists.txt >clusters.0.2.txt
    jarp -c 0.3 -y <neighborlists.txt >clusters.0.3.txt
    jarp -c 0.4 -y <neighborlists.txt >clusters.0.4.txt
    
  6. Displaying the structures of the first cluster using the CreateView and MarvinView applications:
    • Clustering:
      generatemd c input.sdf -k CF -c cfp.xml -D -o fingerprints.txt
      jarp -g -t 0.1 -c 0.3 -f 512 < fingerprints.txt > clusters.txt
      
    • Creating an SDfile containing the structures from the first cluster (clid=1):
       
      crview -i id -c "clid=1" -s input.sdf -t clusters.txt >jarp_result1.sdf
      
    • Displaying the structures and the NSC field (it comes from the original SDfile):
      mview -c 3 -r 3 -f NSC jarp_result1.sdf
      
    A screenshot of MarvinView showing the cluster:

  7. Displaying the central objects of clusters that contain at least 20 compounds (size>=20) using the CreateView and MarvinView applications:
    • Clustering:
      generatemd c input.sdf -k CF -c cfp.xml -D -o fingerprints.txt
      jarp -g -t 0.1 -c 0.3 -f 512 -x -z < fingerprints.txt > clusters.txt
      
    • Creating an SDfile containing central objects of the clusters satisfying the condition:
      crview  -i "centr:2" -c "size>=20" -d "clid:size" -s input.sdf -t clusters.txt >jarp_result1.sdf
      
    • Displaying the structures, the NSC field (comes from the original SDfile), and the cluster size (only for the central compounds):
      mview -c 3 -r 3 -f "NSC:clid:size" jarp_result2.sdf
      
    A screenshot of MarvinView showing the central objects:

 

References

  1. Jarvis, R. A.; Patrick, E. A. Clustering Using a Similarity Measure Based on Shared Nearest Neighbors IEEE Trans. Comput. 1973, C22, 1025-1034
  2. Brown, R. D.; Martin. Y. C. Use of Structure-Activity Data To Compare Structure-Based Clustering Methods and Descriptors for Use in Compound Selection J. Chem. Inf. Comput. Sci. 1996, 36, 572-584
  3. Barnard, J. M.; Downs, G. M. Chemical Fragment Generation and Clustering Software J. Chem. Inf. Comput. Sci. 1997, 37, 141-142
 
Copyright © 1999-2008 ChemAxon Ltd.    All rights reserved.