ward [<options>]
Prepare the usage of the ward script or batch file
as described in
Preparing the Usage of JChem Batch Files and Shell
Scripts.
Or call the Ward class directly:
Win32 / Java 2 (assuming that JChem is installed in c:\jchem):
java -cp "c:\jchem\lib\jchem.jar;%CLASSPATH%" chemaxon.clustering.Ward [<options>]
Unix / Java 2 (assuming that JChem is installed in /usr/local/jchem):
java -cp "/usr/local/jchem/lib/jchem.jar:$CLASSPATH" \
chemaxon.clustering.Ward [<options>]
Because the utility has many parameters, it may be reasonable to create a shell script or a batch file for calling the software.
General options:
-h --help this help message
-d --driver <JDBC driver> JDBC driver
-u --dburl <url> URL of database
-l --login <login> login name
-p --password <password> password
-P --proptable <tablename> name of property table
-s --saveconf save settings into ~/.jchem
Input options (default: standard input):
-i --input <filepath> input file path (text file input)
-q --query <sql> SQL query for reading input
(database input)
Output options (default: standard output):
-o --output <filepath> output file path (text file output)
-a --statement <sql> SQL statement for inserting results
(database output)
-x --central calculate and sign central objects
-y --singlet singletons get negative cluster ids
-z --statistics print statistics
-Z --only-statistics print only statistics
-K --Kelley <filepath> print Kelley statistics into text file
-v --verbose verbose output
Data properties
-m --dimensions <dim> number of floating-point descriptors
-f --fingerprint-size <bits> binary fingerprint size in bits
fpsize should be a multiple of 32
-w --weights <w1> <w2> ... the weights of the floating-point descriptors
-g --generate-id generate id for each compound
Clustering parameters
-c --cluster-count <count> number of clusters to be generated
-C --only-clustering clusters are generated using input RNN list
If --cluster-count is not set, then RNN list is generated on output.
Warning! Without a valid license key, the software is in demo mode and maximum 1000 structures can be retrieved from the database.
ward script at each run.
To overcome this problem, it is possible to save
some of the settings that are not changed frequently
in the .jchem file stored in the user's
home directory. Use the --saveconf option to
store the following settings:
--driver)
--dburl)
--login)
--password)
--fingerprint-size)
The settings needed for the database connection are also modified and saved by JChemManager. If you successfully entered into the database using JChemManager, then you don't need to set connection for Ward manually.
For more information on setting connection parameters:
--driver)
--dburl)
--login)
--password)
The software may import data from either a text file
(--input) or a database (--query).
The input data must contain the following columns:
| Columns | Type | Content |
|---|---|---|
| Id | Integer numbers | Id of compounds (Optional in text files) |
| fp1, fp2, fp3 ... | Integer numbers |
Fingerprints in integer number blocks The number of fp. columns is fp. length / 32(Optional) |
| d1, d2, d3, ... | Floating point numbers | Other descriptors (Optional) |
Comments:
--generate-id option if the id column
is missing from the input data.
generatemd c -k CF -c cfp.xml -D <structures.smi >fingerprints.txtAn example for the XML configuration file can be found in the
examples/config directory (examples\config for Windows
users).
cd_id and
cd_fpi
columns in JChem's structure tables are appropriate as input.
ward -q "SELECT cd_id, cd_fp1, cd_fp2, cd_fp3, cd_fp4, cd_fp5, cd_fp6 FROM structures" ...
(For the sake of readability only 6 fp. columns is applied in the above example,
but usually this number is much higher.)
The software can write the results of clustering into either a text file
(--output) or a database table
(--statement).
The exported data contains the following columns:
| Columns | Type | Content |
|---|---|---|
| Id | Integer numbers | Identifier of compounds |
| Clid | Integer numbers | Cluster identifier |
| Centr | Integer numbers | Displays whether the object is central |
The last column is written only if the
--central option is specified.
A central object has the smallest sum of dissimilarities
to the other objects in the cluster.
Central object calculation slows down the application
significantly.
Comments for text output:
Comments for database output:
CREATE TABLE clusters (
cd_id INTEGER NOT NULL PRIMARY KEY,
cluster_id INTEGER)
CREATE TABLE clusters (
cd_id INTEGER NOT NULL PRIMARY KEY,
cluster_id INTEGER,
central SMALLINT)
DELETE FROM clusters;
-a option),
which inserts the rows containing the results.
ward -a "INSERT INTO clusters(cd_id, cluster_id, central) VALUES(?,?,?)" ...
The "?" symbols will be substituted with the corresponding
values.
SELECT * FROM clusters WHERE cluster_id = 1
central column is Optionally, Ward can print clustering statistics into
the standard output or the given output file.
The parameters that enable statistics printing are --statistics
or --only-statistics. (The latter one doesn't allow to
print information on individual compounds.)
The following data will be printed:
The calculation is significantly slower if statistics is enabled, since all pairwise dissimilarity values have to be calculated. (Heuristics cannot be applied.)
Hierarchic clustering techniques, like the Ward method, can
cluster the set at any chosen hierarchy level.
However, in most cases, there is no obvious way to select the
optimal number of clusters.
Using the --Kelley <filepath> option,
an optimized hierarchy level can be calculated using the
Kelley method and the resulting statistics
is written into the specified file.
The Kelley measure balances the normalized "spread" of the clusters at a particular level with the number of clusters at that level. For a given cluster level l, it is defined as:

where n is the number of elements in all clusters, kl is the number of clusters, AvSprl is the average spread of the cluster at level l and min(AvSpr) and max(AvSpr) are the minimum and maximum of this value across all of the cluster levels.
The spread of a cluster m is given by:

where N is the number of the members in the cluster, i and j are members of cluster m and dist(i,j) is the Euclidean distance between the two members i and j.
--fingerprint-size
--dimensions
--weights
--cluster-count
By default, the heap size in some Java runtime environments is limited to 64MB, so you may run out of memory easily. See the FAQ on increasing the heap size.
--cluster-count option correctly, is important
in fine tuning the clustering process.
Since reciprocal nearest neighbor searching is much more time consuming
than the clustering stage, it is reasonable to separate
the two processes.
In that case clustering can be run several times with different
--cluster-count settings.
If --cluster-count is not specified,
Ward collects and stores the list of
RNN pairs and their distances in a text file. If this file is fed into Ward,
the RNN searching is omitted.
When creating the RNN list without clustering, the
--common
--statistics
--only-statistics
options are not available.
If the --only-clustering option is specified
for Ward, then
--central) is not available
--query--weights--generate-id--dimensions--fingerprint-size
In the examples it is supposed that all connection parameters are set and stored by JChemManager (or a previous saving by Ward)
set QUERY="SELECT cd_id, cd_fp1, cd_fp2, cd_fp3, cd_fp4, cd_fp5, cd_fp6, cd_fp7, cd_fp8, cd_fp9, cd_fp10, cd_fp11, cd_fp12, cd_fp13, cd_fp14, cd_fp15, cd_fp16 FROM structures WHERE cd_id < 10000" ward -q %QUERY% -c 100 -f 512
QUERY="SELECT cd_id, cd_fp1, cd_fp2, cd_fp3, cd_fp4, cd_fp5, cd_fp6, cd_fp7, cd_fp8, cd_fp9, cd_fp10, cd_fp11, cd_fp12, cd_fp13, cd_fp14, cd_fp15, cd_fp16 FROM structures WHERE cd_id < 10000" INSERT="INSERT INTO clusters(cd_id, cluster_id) VALUES(?,?)" ward -q "$QUERY" -a "$INSERT" -c 100 -f 512
Make sure that the clusters table exists and is empty
before running the script.
generatemd c -k CF -c cfp.xml -D <input.smi | ward -f 512 -c 100 -g
generatemd c -k PF -c pharma-frag.xml -D <input.smi | ward -f 0 -m 210 -c 100 -g
-c parameters.
Using the output of an RNN list generation.
Singletons get negative cluster ids.
generatemd c -k CF -c cfp.xml -D <input.smi >fingerprints.txt ward -f 512 -g <fingerprints.txt >neighborlists.txt ward -C -c 10 -y <neighborlists.txt >clusters.10.txt ward -C -c 50 -y <neighborlists.txt >clusters.50.txt ward -C -c 100 -y <neighborlists.txt >clusters.100.txt
generatemd c input.smi -k CF -c cfp.xml -D -o fingerprints.txt ward -f 512 -g -K kelley.txt <fingerprints.txt >neighborlists.txtAn example for the generated text file (
kelley.txt):
Kelley Indexes for All Cluster Levels level index 1 500.000 2 261.018 ... 18 32.038 ... 498 499.000 499 500.000 Optimal number of clusters: 18Clustering using the suggested number of clusters and the generated RNN list. Singletons get negative cluster ids.
ward -C -c 18 -y <neighborlists.txt >clusters.18.txt
generatemd c input.sdf -k CF -c cfp.xml -D -o fingerprints.txt ward -g -c 10 -f 512 < fingerprints.txt > clusters.txt
crview -i id -c "clid=1" -s input.sdf -t clusters.txt >ward_result1.sdf
mview -c 3 -r 3 -f NSC ward_result1.sdf
generatemd c input.sdf -k CF -c cfp.xml -D -o fingerprints.txt ward -g -c 10 -f 512 -x -z < fingerprints.txt > clusters.txt
crview -i "centr:2" -c "size>=20" -d "clid:size" -s input.sdf -t clusters.txt >ward_result1.sdf
mview -c 3 -r 3 -f "NSC:clid:size" ward_result2.sdf