HitStatistics is a tool that facilitates the setup of screening (ScreenMD) a large database of molecules against a class of actives. It performs the screening on a prepared smaller set of molecules consisting of molecules from a potential target set and some specifically added molecules. It evaluates the screening, giving valuable information about its performance. Screening with HitStatistics can be performed with different molecular descriptors and parametrized metrics. Based on the results of this evaluation, the user can select the most promising settings, and execute the screening on large databases (by applying ScreenMD) only using these settings.
This application is intended for use after optimizing the parameters of parametrized metrics using OptimizeMetrics to evaluate the performance of the parametrized metrics generated there, and before performing the final screening by executing ScreenMD. These three applications are close-knit, thus this document assumes the knowledge of definitions and functions described in ScreenMD and OptimizeMetrics.
A separate tool, ScreeningOptimizer unifies the functions of OptimizeMetrics and HitStatistics with simplified usage. It also prepares the random molecule sets required by these applications from the target set of molecules and the actives.
Usage of HitStatistics is very similar to the usage of ScreenMD, since it performs the same screening, except two aspects. As mentioned above, the given sets of molecules must be specifically prepared, the same way, as in the metric optimizer. Three molecule sets should be provided (similarly to OptimizeMetrics): a potential target set, which is assumed to contain predominantly inactive molecules; some molecules, which are known to exhibit the same properties, as the actives; and the actives to which molecules from the other two sets are compared. A random subset of the large database, the screening of which is the final goal, is a good choice for the target set of statistical calculations. The similar test molecules can be obtained by selecting some of the actives themselves, but in this case these should not be included in the set of query molecules. The subset of the final target set can be selected randomly by using a small application1. A significant difference to OptimizeMetrics is, that the size of the target subset can be much bigger, since only one screening is performed. In our tests this random subset of targets consisted of about 10000 molecules. This number can be increased, but then the program should be run in memory safe mode (in this case not all dissimilarity values will be stored in the memory, which slows down the execution, especially, if several molecular descriptors are used).
The second difference to the usage of ScreenMD is that database connection is not supported, since the amount of processed molecules can be managed with files.
The result of HitStatistics is a text file, which contains the names of the input files (target, similar test set, queries), the total number of molecules in these files. Statistics about the performance of the screen can be gathered by each molecular descriptor-metric pair.
Note! It is planned, that in the future the same statistics will be available for each descriptor, or for all descriptors at the same time (in this case, descriptors and metrics are not treated independently of each other, the overall setup is handled). At present stage, only the option of separate statistics for each metric of each molecular descriptor is supported!
The following statistical data are supplied in a table format for each molecular descriptor-metric pair: number of target hits, number of similar test set hits, the achieved enrichment ratio and selectivity effectiveness, and the threshold, with which these results were obtained.
An additional feature of HitStatistics is the calculation of distribution of dissimilarity values for the selected metrics. Distribution is given in a histogram like format: number of dissimilarity values falling into the intervals of an equidistant division of the given range is given for each interval. The number of dissimilarity values falling under the lower end of the range or greater than the upper end of the range, are gathered in two separate histograms.
hitstatistics <target file> <test file> <query file> [<options>]
hitstatistics config <configuration file> [<general options>]
These two modes are not strictly exclusive, they can be mixed in various ways. Command line parameters can extend settings provided in the configuration file. File names can be specified in the command line even when parameters are defined in the configuration file, in this case the files defined in the command line are processed. However, this kind of usage is recommended only for expert users. Thus, the exact specification of the command line syntax is as follows:
hitstatistics config <configuration file> <target file> \
<test file> <query file> [<options>]
Note, that when specified, the configuration file must be the first argument
after the hitstatistics command in the command line. Similarly,
file names are positional, if input is taken from file, filenames must follow
either the command name or the name of the configuration file. Also note, that
the order of the filenames is definite: first the target file is specified,
followed by the name of the test file, then the query file.
Prepare the usage of the hitstatistics script or batch file
as described in Preparing the Usage of JChem
Batch Files and Shell Scripts.
Alternatively, the HitStatistics class can be directly
invoked:
Win32 / Java 2 (assuming that JChem is installed in c:\jchem):
java -cp "c:\jchem\lib\jchem.jar;%CLASSPATH%"
chemaxon.descriptors.HitStatistics
<target file> <test file> <query file>
[<options>]
java -cp "c:\jchem\lib\jchem.jar;%CLASSPATH%"
chemaxon.descriptors.HitStatistics
config <configuration file> [<general options>]
Unix / Java 2 (assuming that JChem is installed in /usr/local/jchem):
java -cp "/usr/local/jchem/lib/jchem.jar:$CLASSPATH" \
chemaxon.descriptors.HitStatistics \
<target file> <similars file> <query file> \
[<options>]
java -cp "/usr/local/jchem/lib/jchem.jar:$CLASSPATH" \
chemaxon.descriptors.HitStatistics \
config <configuration file> [<general options>]
Options and parameters can either be defined in the command line or be specified in an XML configuration file. The command line mode is more suitable for smaller experiments. In contrast to this, configuring HitStatistics from XML is convenient even for much larger virtual screening exercises. Although an example configuration file is available, users are not encouraged to write such configuration files manually, instead the use of the XML configuration editor is highly recommended.
General options:
-h, --help this help message
-x, --expert-help advanced options for expert users
-v, --verbose verbose
Output options:
-o, --output <filepath> statistics output file name with full path
-g, --generate-id [<first>]
generate unique structure identifiers
an optional value for the first ID can be given
-e, --precision <prec> number of decimal places after the decimal point
Descriptor options:
-k, --descriptor <type> <common descriptor options> <type specific flags>
use/generate descriptors of the given type
according to the type specific flags (see below)
Supported types are: CF, PF
Common descriptor options:
-c, --config <configfile>
path and name of the XML configuration file
-t, --use-tag [<name>] use existing descriptor data
-M, --metric {<name>} use the metric <name> as specified in the config file
Descriptor type specific options:
2D pharmacophore fingerprint options:
-z, --fuzzy <smoothing factor>
generate fuzzy fingerprints with the given
fuzzy smoothing factor
Similarity options:
-Q, --compare-queries compare against query pharmacophores
-H, --compare-hypothesis [<name> [C]]
compare against hypothesis <name>
Valid names are: Minimum, Average, Median.
Default hypothesis type is Minimum.
'C' indicates consensus fingerprint.
This flag may occur more than once with different
hypothesis types.
Statistics options:
-b, --by-metrics calculate statistics by each descriptor and metric
separately
-d, --by-descriptors calculate statistics by each descriptor separately
-a, --by-all calculate overall statistics
-s, --metric-distribution <lower bound> <upper bound> <histogram count>
generate information about the distribution of
metric values. Histogram count determines the
number of histograms to be generated (resolution).
From these histograms two gather the number of
dissimilarity values below and above the given
lower and upper bounds.
-f, --memory-safe-mode for cases when large number of molecules are to be
processed, in this case dissimilarity values are
not stored, but recalculated, if necessary.
Advanced options for expert users:
SDfile options:
-I, --id-tag <name> name of the tag storing unique molecule identifiers
2D pharmacophore fingerprint options:
-P, --PMAP-tag [<name>] use existing PMAP data
-G, --Gaussian-cutoff <value>
not smoothing outside <value>*sigma
-B, --smoothing-bound <value>
lower bound for fuzzy smoothing
-A, --asymmetrical-smoothing
do not smooth left side
-R, --ignore-rotomers do not take into account the effect of rotatable
bonds
Similarity options:
-r, --descriptors-and thresholds for all descriptors, default is any
-m, --metrics-and thresholds for all metrics, default is any
-Z, --zero-threshold percentage threshold for zero limit in median
hypothesis
Statistics options:
-l, --selectivity-asymmetry <value>
asymmetry factor for evaluator function selectivity
effectiveness. Default is 0.5 (no asymmetry).
Most of the options is the same, as in the application ScreenMD, for a detailed description see ScreenMD options. There are fewer input/output options here, as database and descriptor file input is not supported yet. The specific options are listed in the 'Statistics options' section.
From the--by-metrics, --by-descriptor, and
--by-all flags only the first is supported yet, which means
that statistical data can be gathered separately for each metric of each
molecular descriptor. In this case the similarity related flags
--metrics-and and --descriptors-and are not
applicable, since each metric and each descriptor is treated separately.
Note! Flags --by-descriptor and --by-all
will be implemented in the future. If these flags will be set, then statistics
will be provided about a screening performed separately for each descriptor, but
with each metric at the same time (option --by-descriptor,
in this case the flag --metrics-and is also meaningful); or
a screening with each descriptor and each metric at the same time
(option --by-all, each similarity flag is applicable).
At present stage, only the option of separate statistics with each metric of
each chemical descriptor is supported, as this is probably the most common
usage.
Distribution of dissimilarity values can be retrieved for each metric. Distribution is given for equal size intervals between the lower and upper bounds. Size of the intervals is determined by histogram count. The first interval gathers the number of dissimilarity values under the given lower bound, the last gathers number of values above the the upper bound. The lower end of the first interval is the lowest calculated dissimilarity value (if lower than the given lower bound), the upper end of the last interval is the greatest calculated dissimilarity value (if greater than the given upper bound).
If larger amount of molecules is to be processed, then memory safe mode should be set. In this case not all dissimilarity values are stored in the memory, but only the ones related to one target structure. This might slow down the execution if several statistical queries are performed, as dissimilarity values must be calculated in consecutive steps instead of calculating in one go, or recalculated several times.
An advanced option is setting the asymmetry factor in the formula of selectivity effectiveness. It is reasonable to use the same setting, as the one used in the preceding metric parameter optimization.
In the XML configuration file the same parameters can be defined. These options
appear in the config configuration editor labeled with the above
long forms of command line parameter names, with only small differences. In
most of the cases only the first letters of words are capitalized.
For example --compare-queries is displayed as
CompareQueries. In other cases, especially when the option has
parameters, instead of one edit field, a frame has to be filled in. For example,
--compare-hypothesis is exchanged with a frame, where all
the hypotheses can be specified with their type and a consensus can be selected.
The use of the configuration editor is very straightforward and simple.
Merging short forms of command line options is not supported (that is,
instead of -vQ the form -v -Q should be used).
Warning! To use HitStatistics a valid license key is needed. When no valid license key is found in the home directory, HitStatistics runs in demo mode, where the number of molecular descriptors to be processed is limited to 2000 (thus if several types of molecular descriptors are generated, then the number of structures may be limited to few hundreds).
Since HitStatistics does not process a large amount of data, database connection is not supported yet.
Most molecular file formats are accepted ( MDL molfile, Compressed molfile, SDfile, Compressed SDfile, SMILES, etc.).
If the input file is an SDfile, it may already contain descriptors of
molecules. This information can either be used or ignored. The default
behavior of HitStatistics is to ignore such information, in which case
descriptors are generated from the original molecular structures. This can be
overridden with the --use-tag flag, then descriptors stored in the
SDfile tags are used. The default SDfile tags for storing molecular descriptor
and related data are:
Other than default tag names can be specified with the --tag-name
option.
SDfiles containing descriptors can be generated with
GenerateMD. Existing descriptors are
worth being reused as doing so can reduce running times.
HitStatistics writes the statistical information detailed
earlier into a text file. Its name should be specified with
the --output flag. If the name is not
specified, then results are written to the standard output.
Beside the XML configuration file that can be optionally used to specify parameter settings (see Usage), OptimizeMetrics takes mandatory configuration files too. These files correspond to molecular descriptors used for screening, there should be one file per descriptor.
Molecular descriptor settings are also defined in external text (XML) files, these settings are described in PMapper configuration and ScreenMD configuration. These configuration files can be edited by the Configuration Editor GUI, which alleviates the setup of the required parameters. There are sample configuration files available in the 'examples/config' directory (see pharmacophore fingerprint configuration, chemical fingerprint configuration and HitStatistics configuration). ).
cd example ./bin/hitstatistics config config/hitstatistics.xml -v
This example is available as an executable example in
the 'examples/bin' directory (see hitstatistics_example).
It calculates hit statistics about the parametrized metrics generated by
optimizemetrics_example, which should be executed prior to
calling hitstatistics_example.
A Unix2 script (prescreen), used in-house to test OptimizeMetrics and HitStatistics is detailed below. We feel that going through this example is the easiest way to understand the typical usage of these tools. This script performs optimization of parametrized metrics with different settings and different goals. Then statistics are obtained by HitStatistics for each setting. Three parameters are to be provided:
This example is for advanced users, similar function with much simpler usage can be achieved by using the tool ScreeningOptimizer.
Test results obtained by applying this script to 10000 structures from the NCI database and the active class of alpha1-adrenoceptor agonists (consisting of 7 structures) are provided for each run of the programs. Each test was performed three times with three random divisions of the target set and the set of actives to validate the results: they should be similar independently of the actual random division.
#!/bin/bash for (( i = 1; i <= $3; i++ ));
Perform each test $3 times.
do ############################ #Generate random test files ############################ randomms $1 opt-$1 hit-$1 -n 300 -v; randomms $2 hit-$2 opt-$2 -p 35 -v; randomms opt-$2 opt-test-$2 opt-query-$2 -p 50 -v;Cut the target set into two disjoint subsets using RandomMS: put 300 structures into the target used for optimization, the rest into the test set used in the statistical calculations. Cut the active set into three: put 35% of the original set into the test set of statistical calculations (3 molecules in the example), 50% of the rest into the test set of optimization (2 molecules in the example), the remaining molecules are used as queries (from which the hypothesis will be generated, 2 molecules in the examples). Disjoint sets are generated to obtain realistic validation results: molecules used for training the parameters in OptimizeMetrics should not be reused as tests or targets in HitStatistics.
#######################################
#Optimize for SelectivityEffectiveness
#######################################
optimizemetrics \
opt-$1 opt-test-$2 opt-query-$2 -e 3 -H -v -f SelectivityEffectiveness \
-k CFp -c CFp.xml -o opt-CFp-$1-$2.xml \
-M Tanimotot Tanimoto \
-M Tanimotoa Tanimoto -a \
-M Euclideant Euclidean \
-M Euclideann Euclidean -n \
-M Euclideana Euclidean -a \
-M Euclideanw Euclidean -w \
-M Euclideanwan Euclidean -w -a -n \
-k PFp2D -o opt-$1-$2.xml -c pharma-frag.xml \
-M Tanimotot Tanimoto \
-M Tanimotos Tanimoto -s \
-M Tanimotoa Tanimoto -a \
-M Tanimotosa Tanimoto -s -a \
-M Euclideant Euclidean \
-M Euclideann Euclidean -n \
-M Euclideana Euclidean -a \
-M Euclideanw Euclidean -w \
-M Euclideanwan Euclidean -w -a -n \
-k PFp2D -o opt-$1-$2-z0.3.xml -c pharma-frag.xml -z 0.3 \
-M Tanimotot0.3 Tanimoto \
-M Tanimotos0.3 Tanimoto -s \
-M Tanimotoa0.3 Tanimoto -a \
-M Tanimotosa0.3 Tanimoto -s -a \
-M Euclideant0.3 Euclidean \
-M Euclideann0.3 Euclidean -n \
-M Euclideana0.3 Euclidean -a \
-M Euclideanw0.3 Euclidean -w \
-M Euclideanwan0.3 Euclidean -w -a -n \
-k PFp2D -o opt-$1-$2-z0.7.xml -c pharma-frag.xml -z 0.7 \
-M Tanimotot0.7 Tanimoto \
-M Tanimotos0.7 Tanimoto -s \
-M Tanimotoa0.7 Tanimoto -a \
-M Tanimotosa0.7 Tanimoto -s -a \
-M Euclideant0.7 Euclidean \
-M Euclideann0.7 Euclidean -n \
-M Euclideana0.7 Euclidean -a \
-M Euclideanw0.7 Euclidean -w \
-M Euclideanwan0.7 Euclidean -w -a -n;
Optimization of parametrized metrics for basic pharmacophore fingerprint, for fuzzy pharmacophore fingerprints with fuzziness factors 0.3 and 0.7, and for chemical fingerprints. A minimum hypothesis is generated from the query set, all comparisons are performed to this hypothesis. Optimization maximizes the value of the evaluator function selectivity effectiveness. For each descriptor the following parametrized metrics are set:
In the case of fuzzy pharmacophore fingerprints the value of the fuzziness factor is also added to the name of the parametrized metric.
#
hitstatistics \
hit-$1 hit-$2 opt-query-$2 -o $i-SE-$1-$2.stat -e 3 -g -v -H -b \
-k PFp2D -c opt-$1-$2.xml \
-k PFp2D -c opt-$1-$2-z0.3.xml \
-k PFp2D -c opt-$1-$2-z0.7.xml \
-k CFp -c opt-CFp-$1-$2.xml;
All the parametrized metrics added to the configuration by OptimizeMetrics, as well as the ones already in there before the optimization are tested on a set of 9700 molecules. It is done for each molecular descriptor, set by OptimizeMetrics previously. Again, comparison is performed to the generated hypothesis (it is recommended to use the same settings, as in OptimizeMetrics). See resulting statistics (with first, second, third random selection) for the example of alpha1-adrenoceptor agonists.
################################################## #Optimize for Asymmetric SelectivityEffectiveness ################################################## optimizemetrics \ opt-$1 opt-test-$2 opt-query-$2 -e 3 -H -v -f SelectivityEffectiveness 0.3 \ ... # hitstatistics \ hit-$1 hit-$2 opt-query-$2 -o $i-SE-asymmetric-$1-$2.stat -e 3 -g -v -H -b -l 0.3 \ ...
The same optimization and statistics as above, only optimization is done for asymmetric selectivity effectiveness, with asymmetry factor 0.3. See resulting statistics (with first, second, third random selection) for the example of alpha1-adrenoceptor agonists.
########################## # Optimize for Enrichment ########################## optimizemetrics \ opt-$1 opt-test-$2 opt-query-$2 -e 3 -H -v -f Enrichment \ ... # hitstatistics \ hit-$1 hit-$2 opt-query-$2 -o $i-E-$1-$2.stat -e 3 -g -v -H -b \ ... ################################################ done
The same optimization and statistics as above, only optimization is done for enrichment ratio. See resulting statistics (with first, second, third random selection) for the example of alpha1-adrenoceptor agonists.
hitstatistics target.sdf test.sdf query.sdf -v -Q -H Minimum -H Average \ -s 0 100 20 -f -k PFp2D -c pharma-frag.xml -M Tanimoto Euclidean
Dissimilarities are calculated between two sets. One consists of all structures target.sdf and test.sdf, the other contains all queries structures in query.sdf, and the average and minimum hypotheses generated from these queries. Distribution is given for metrics Tanimoto and Euclidean of 2D pharmacophore fingerprints in 20 histograms focusing on the values between 0 and 100. Memory safe mode is used. Output is written to standard output.
-h. It can be found in the 'examples/bin' directory.