Virtual Screening Examples

Content

Introduction

The purpose of this document is to demonstrate the usage of applications in the Screen package through examples. These examples range from simple to more complex ones and cover many aspects of virtual screening using ChemAxon's tools. Pharmacophore perception, molecular descriptor generation, screening and its optimization are illustrated.

Prerequisites

In order to run these examples you need:

  1. Java Virtual Machine version 1.4 or higher and JChem installed on your computer.
  2. Some environment variables have to be set for proper operation, see description in Preparing and Running JChem's Batch Files and Shell Scripts manual for details.
  3. The applications explained here have to be run as commands in a terminal window (shell) in UNIX or Command Prompt in Windows. To launch a command window in Windows: Start menu -> All Programs -> Accessories -> Command Prompt or Start menu -> Run > cmd). Alternatively, you can use a UNIX-like shell, Cygwin, top.
  4. In the terminal/command prompt window change to your JChem installation directory. Then type
    cd examples/screen
    
    in UNIX, or
    cd examples\screen
    
    in Windows.
 

Pharmacophore point perception and mapping

This example demonstrates the perception of pharmacophore point types of compounds using the PMapper program.
Input structures are retrieved from an SDfile (nci1000.sdf). Results will be stored in nci1000-pmap.sdf. This output file will contain all original structures along with their pharmacophore point map stored in the custom PMAP-FRAG tag. Pharmacophoric point perception is based on a simple rule-base which defines six different pharmacophore atom types: anionic, cationic, hydrogen bond donor, hydrogen bond acceptor, hydrophobic and aromatic.

pmapper nci1000.sdf -c pharma-frag.xml -o nci1000-pmap.sdf -S -t PMAP-FRAG 

The use and meaning of command-line flags in the above command:

OptionDescriptionDefault
-c specifies the name of the XML configuration file -
-o specifies the name of the output file standard output (terminal windows where pmapper runs) when omitted
-S forces SDfile output -
-t specifies the name of the SDF tag where pharmacophore data is stored PMAP

The pharmacophore types are mapped to individual atoms. When displayed in MarvinView they can be colored according to the schema below:

pharmacophore
type
display
color
pharmacophore
type
display
color
cationic  anionic 
aromatic  hydrophobic 
donor  acceptor 
acceptor and donor  anionic and acceptor 
other combination  none 

This schema is defined in the XML configuration file, in the ScreeningConfiguration section. To display structures using pharmacophore type coloring execute:

mview nci1000-pmap.sdf -t PMAP-FRAG -p PF.colors 

This command will pop up a MarvinView application window and will display pharmacophore type colored structures.

The use and meaning of command-line flags in the above command:

OptionDescription
-t specifies the name of the SDF tag where pharmacophore data is stored
-p specifies the name of the field containing the coloring schema

The real point of pharmacophore mapping is not pure visualization but the generation of pharmacophore fingerprints that can be used in virtual screening.

Further examples demonstrating the use of the PMapper tool are found in the corresponding user documentation.

Molecular descriptor generation

Various molecular descriptors are generated by the GenerateMD program. The complete list of available descriptors can be obtained by:

generatemd -L

This command displays the name and abbreviation of all descriptors:

GenerateMD - Molecular Descriptor Generator 1.5.12, (C) 2002-2014 ChemAxon Ltd.
Built in descriptor types:
  Chemical fingerprint (CF)
  Pharmacophore fingerprint (PF)
  BCUT descriptors (BCUT)  
  Hydrogen bond donor/acceptor count (HDon/HAcc)
  octanol-water distribution coefficient (LogD)
  octanol-water partition coefficient (LogP)
  topological polar surface area (TPSA)
  mass of molecule (Mass)
  number of heavy atoms (Heavy)

Some examples are given below.

Pharmacophore fingerprint

This example demonstrates the generation of topological pharmacophore fingerprints for 1000 compounds taken from an SDfile (nci1000.sdf). Results will be stored in nci1000-PF.sdf. This output file will contain every original structures along with their pharmacophore point map (in the PMAP tag) as well as the pharmacophore fingerprints generated (in the PF tag).
The progress of the descriptor generation will be reported in the terminal window after each 100 molecular structures have been processed.

generatemd c nci1000.sdf -k PF -c pharma-frag.xml -o nci1000-pfp.sdf -S -v 100

The use and meaning of command-line flags in the above command:

OptionDescriptionDefault
c is a command abbreviation which stands for 'create' -
-k specifies the type of the molecular descriptor to be generated, PF stands for PharmacophoreFingerprint -
-c specifies the name of the XML configuration file -
-o specifies the name of the output file standard output, when -o is omitted
-S forces SDfile output
-v verbose mode, progress is reported after each 100 structures processed 1000

The output file contains two new custom tags after each molecular structure:

>  <PMAP>
h;h;h;h;a;h;h;h;a

>  <PF>
a a = | 0 0 0 0 1 0 0 0 0 0 |, h a = | 2 4 5 3 0 0 0 0 0 0 |, h h = | 7 8 5 1 0 0 0 0 0 0 |

Such fingerprints can directly be used in virtual screening.

Chemical fingerprint

This example demonstrates the generation of topological pharmacophore fingerprints for 1000 compounds taken from an SDfile (nci1000.sdf). Results will be stored in a descriptor file. This output file will not contain original structures only the descriptors generated. The program runs in quiet mode, no message is written about its progress.

generatemd c nci1000.sdf -k CF -c cfp.xml -o nci1000.cfp 

The use and meaning of command-line flags in the above command:

OptionDescription
c is a command abbreviation which stands for 'create'
-k specifies the type of the molecular descriptor to be generated, CF stands for ChemicalFingerprint
-c specifies the configuration file for the CF descriptor

Note, that the XML configuration file is optional for chemical fingerprint generation and it is omitted in the above example.

A portion of the output file is displayed below:

Such fingerprints can directly be used in virtual screening and also in clustering.

Scalar descriptors

This example demonstrates the generation of a scalar descriptor, the topological polar surface area. 1000 compounds are taken from an SDfile (nci1000.sdf). Results will be stored in a descriptor file. This output file will not contain original structures only the descriptors generated. The program runs in quiet mode, no message is written about its progress.

generatemd c nci1000.sdf -k TPSA -o nci1000.tpsa

The use and meaning of command-line flags in the above command:

OptionDescription
c is a command abbreviation which stands for 'create'
-k specifies the type of the molecular descriptor to be generated, TPSA stands for Topological Polar Surface Area.

Note, that an XML configuration file is not needed for the generation of most scalar descriptors.

Further examples demonstrating the use of the GenerateMD tool, including examples when structures are taken from database tables and descriptors are stored in the database too, are found in the corresponding user documentation.

Screening using molecular descriptors

ChemAxon's virtual screening application is called ScreenMD. This program recognizes a wide variety of molecular descriptors and can perform dissimilarity calculation using one or more descriptors at the same time.

Chemical similarity

This example demonstrates how chemical fingerprints generated in the above example and stored in a molecular descriptor file can be used in searching for compounds that are chemically (topologically) similar to a query compound (stored in a file as a smiles string):

screenmd ace-1.smiles -k nci1000.cfp

ace-1.smiles is the name of the query file which contains the above displayed structure.

The use and meaning of command-line flags in the above command:

OptionDescription
-k specifies the type of the molecular descriptor to be used, in this example it is not an abbreviation of a descriptor as listed above but the name of a descriptor file generated previously. Note, that such files store the original configuration used when they were generated. This assures that the same parametrization is used in screening too.

The output is written into the terminal window where the command is executed. Each row contains two number, the Tanimoto and the Euclidean dissimilarity ratios between the query and one target structure. Note, that not all dissimilarity values are displayed, only those that are lower than a predefined dissimilarity threshold (this is 0.2 for the Tanimoto dissimilarity, and 10 in the case of the Euclidean distance). If either values are below the threshold, both are displayed, that is, the corresponding target structure is a virtual hit.
A portion of the output of the above command is as follows:

        q1_CF_Tan       q1_CF_Euc
        0.66    9.70
        0.48    8.43
        0.58    9.49
        0.52    9.54
        0.52    9.75
        0.62    9.80
        0.56    9.00
        0.61    9.95
        0.64    9.54
        0.62    10.00
        0.61    9.75
        0.59    9.38
        0.51    9.33
        0.49    8.66
        0.45    8.72
        0.38    8.06
        0.49    9.64
        0.50    9.22
        0.50    9.22
        0.50    9.22
        0.50    9.22
        0.50    9.22
        0.50    9.22
        0.50    9.22
        0.35    7.87

To get a more useful output two more flags need to be used:

screenmd ace-1.smiles -k nci1000.cfp -M Euclidean -g

The use and meaning of these two command-line flags in the above command:

OptionDescriptionDefault
-M selects the specified metric for dissimilarity calculations all metrics in the configuration
-g generates unique structures id-s and writes them into the output, may take an optional starting value 1

The output written in the command window is:

id      q1_CF_Euc
90      9.70
185     8.43
200     9.49
205     9.54
220     9.75
226     9.80
228     9.00
236     9.95
239     9.54
293     10.00
309     9.75
315     9.38
316     9.33
317     8.66
318     8.72
319     8.06
323     9.64
324     9.22
325     9.22
326     9.22
327     9.22
328     9.22
329     9.22
330     9.22
331     7.87
332     9.75

Note, that missing id values correspond to structures that did not pass similarity screening, their dissimilarity was above the threshold for acceptance.

Pharmacophore similarity

This example demonstrates how pharmacophore fingerprints generated in the above example and stored in an SDfile can be used in searching for compounds that have similar pharmacophore as some given query compounds (stored in a file as a smiles strings). Four ACE inhibitors, displayed below, are used as query structures. (Note, that atoms are colored according to their pharmacophore type.)

screenmd nci1000-pfp.sdf ace-1-4.smiles -k PF -c pharma-frag.xml -t PF -g -o table hits.txt -o sdf hits.sdf

nci1000-pfp.sdf is the target file to be screened while ace-1-4.smiles is the name of the query file which contains the above displayed four structures.

The use and meaning of command-line flags in the above command:

OptionDescription
-k specifies the type of the molecular descriptor to be used, in this example it is PF that abbreviates pharmacophore fingerprint
-c specifies the XML configuration file associated with the current descriptor type
-t tells screenmd to take the value of the descriptor from the SDfile tag named here (note, that PF after -t is the name of the tag and could be anything)
-g generates unique structures id-s and writes them into the output
-o specifies the type and name of the output file, table stands for the table of dissimilarity values (as seen above) while sdf produces a standard SDfile which contains some custom tags to store the dissimilarity values

Virtual hits can be visualized with the MarvinView application. Dissimilarity values are displayed below structures. Note, that there are 8 values corresponding to each structure because four queries and two dissimilarity metrics were applied.
Hits can displayed by:

mview -t PMAP -p PF.colors -f q1_PF_Tan:q1_PF_Euc:q2_PF_Tan:q2_PF_Euc:q3_PF_Tan:q3_PF_Euc:q4_PF_Tan:q4_PF_Euc hits.sdf

This hit-set contains all target structures that passed the threshold in a comparison with any of the queries by any of the metrics applied. This is the less restrictive behavior of screening and there are command line flags available to change this default mode.

Screening for hypothesis

A hypothesis can be build from the above 4 individual queries. The next example illustrates screening against a pharmacophore hypothesis.

screenmd nci1000-pfp.sdf ace-1-4.smiles -k PF -c pharma-frag.xml -t PF -H Median -g -o table hits.txt -o sdf hits.sdf

In this example a target compound is only compared against the hypothesis, though with both metrics applied, thus there are two dissimilarity values stored per hit structures.

Also note, that the number of hits decreases from 143 to 26.

Molecular descriptor set

This example demonstrates how various descriptors can be combined in virtual screening. Unlike in the two examples above, where descriptors were generated in separate steps, in this case all descriptors are generated during the virtual screening procedure ("on-the-fly") and are not stored anywhere.

screenmd nci1000.sdf ace-1.smiles -k PF -c pharma-frag.xml -k TPSA -k LogP -o hits.txt -g

In this example three different molecular descriptors, pharmacophore fingerprint, topological polar surface area and octanol-water partition coefficient are combined in a molecular descriptor set. One dissimilarity value is obtained for each target structure-query structure pair which combines individual dissimilarity ratios of component descriptor using Euclidean metric.

The first id-s and dissimilarity scores of the first 25 hits in the output:

id	q1
9	23.19
21	7.19
110	6.71
145	6.13
210	25.43
218	1.40
233	0.96
242	25.96
248	6.92
249	11.22
259	13.17
270	6.52
284	2.50
297	11.14
337	24.87
365	2.80
372	2.35
400	1.06
403	28.48
416	4.32
446	12.14
467	2.59
472	15.73
473	4.53
491	9.53

Optimizing virtual screening

Examples given here illustrate the use of automatic optimization to enhance the efficiency of virtual screening. The optimization tunes various parameter values introduced in parametrized metrics. The available variants of the two standard metrics, Tanimoto and Euclidean are listed below:

  • asymmetric (or directed) Tanimoto
  • scaled Tanimoto
  • scaled asymmetric Tanimoto
  • normalized Euclidean
  • asymmetric Euclidean
  • weighted Euclidean
  • normalized asymmetric Euclidean
  • weighted asymmetric Euclidean
  • weighted normalized Euclidean
  • weighted asymmetric normalized Euclidean

Prior to optimization, it is not known which parametrized metric will exhibit the best performance, thus all metrics are tried to be optimized and the best ones are chosen for screening. In the examples below the active family is a set of ACE inhibitors. For training purposes a small subset of the target database is used, in these examples 300 structures (nci300.smiles). The known actives are divided into three parts, 4 structures are used as training set (ace-1-4.smiles) and four others as queries (ace-5-8.smiles). The third chunk is not used in optimization, it is preserved for validation purposes.

Optimization set-up in command line

optimizemetrics nci300.smiles ace-1-4.smiles ace-5-8.smiles -k PF -c pharma-frag.xml -M TanT Tanimoto -M TanA Tanimoto -a -M TanS Tanimoto -s -o pharma-frag-opti.xml -H Median -v

The use and meaning of command-line flags in the above command:

OptionDescription
-k specifies the type of the molecular descriptor to be used, in this example it is PF that abbreviates pharmacophore fingerprint
-c specifies the XML configuration file associated with the current descriptor type; in optimization it is the input configuration, and the result of the optimization process is written elsewhere, this file is not affected
-M specifies a parametrized metric, it takes two mandatory parameters: the name of the new parametrized metric (e.g. TanT and TanA in this example), the name of the base metric (e.g. Tanimoto) (note, that the base metric has to be defined in the input configuration file specified after the -c flag).
Five optional flags may follow the -M flag: -a, -w, -s and -n, and these can be combined (though not all combinations are valid, for instance Euclidean cannot be scaled, see detailed definition of parametrized metrics).
-a asymmetric metric
-w weighted (Euclidean) metric
-s scaled (Tanimoto) metric
-n normalized (Euclidean) metric
-o specifies the name of the output XML configuration file which contains the entire input configuration plus all optimized parametrized metrics are added
-H a hypothesis build form the individual queries is used in optimization, the default hypothesis type is Minimum

Note, that if no parameter specific flag (a, w, n, s) is given after the name and type of a parametrized metric, it is still valid (e.g. TanT in the example above). In this case parameters are not introduced and tuned, the basic metric is used but the dissimilarity threshold for acceptance/rejection is determined (and store in the XML configuration). Detailed explanation is found in the relevant documentation.

In the case of pharmacophore fingerprints, weighting can be applied in two distinct forms: per cell and per feature. The command line flags are -W and -w respectively.

The optimizemetrics command writes the specified output XML configuration file. Besides, some simple statistics about the performance of the metrics optimized are written in the terminal window. These results, however, have to be validated using an independent test set.

Validation of optimized metrics

An independent blind test is carried out to validate the selectivity of optimized metrics. For this purpose a yet unused subset of the target structures and active structures is needed, but the same query set can be used.

hitstatistics nci700.smiles ace-9-16.smiles ace-5-8.smiles -o hitstat.txt -e 3 -H Median -b -k PF -c pharma-frag-opti.xml

The use and meaning of command-line flags in the above command:

OptionDescription
-o specifies the name of the output file where the statistics are saved
-e specifies the numeric precision of decimal numbers
-H specifies the hypothesis type, this should match the one used in the optimization step
-b statistics are calculated for each descriptor and metric separately
-k specifies the descriptor type
-c specifies the name of the configuration file, this has to be the one generated by optimizemetrics

The output file contains some general information about the input files and detailed statistical data about the performance of basic and optimized parametrized metrics. This file is best viewed in a spreadsheet program.

Each row contains statistics for one parametrized metric. Columns are:

  • Descr: type of the descriptor
  • Metric: name of the parametrized metric
  • Enrichment: enrichment ratio, the higher the better
  • SelectivityEffectiveness: selectivity effectiveness
  • ActiveHitDistribution: Active Hit Distribution, value 1 is ideal
  • Threshold: dissimilarity threshold for the predefined percentage of active hits
  • Test Hits: number of hits from the set of spikes (known actives mixed into the random set), the higher the better
  • Target Hits: number of hits from the random target set, the smaller the better
  • Before actives: this column and the number following this column are for internal use only

Optimization set-up in batch programs

The command above is fairly long, yet it does not exploit the full functionality of OptimizeMetrics. Before optimization it is not know which descriptor and dissimilarity metric combination will provide the best performance. Therefore it is reasonable to optimize for various descriptors and all possible metrics at the same time. These optimization goals do not change too frequently, only does the target molecule set and the set of known active structures. It is very straightforward to create a batch program (in MS-Windows) or a shell script (in UNIX variants) that collects all these common settings and takes the name of the input structure files only.

optimize nci1000.smiles ace.smiles

This command does all necessary preprocessing steps: cuts target file into two (one part used in optimization, the other part kept for validation), similarly, it cuts the active file into three (equal) parts using random picking, then calls the optimizer for three types of molecular descriptors (chemical fingerprint, pharmacophore fingerprint and fuzzy pharmacophore fingerprint) with all available parameterized metrics. When optimization is accomplished, it writes three XML configuration files that contain the optimized parametrized metrics.
The output configurations are directly fed into the next stage, the validation of metrics. Appropriate input files are automatically created and selected for this purpose. Results are written in table format into the file example_statistics.stat. The most relevant part of this file is shown here (in a spreadsheet program, for easy viewing).

The highest enrichment value (and also the highest active hit distribution) is achieved by the Euclideanwa metric (weighted, asymmetric Euclidean) for pharmacophore fingerprint, thus this descriptor and this metric should be used in the screening of the entire library for compounds matching the ACE pharmacophore.

Do you have a question? Would you like to learn more?

Please browse among the related topics on our support forum or search the website. If you want to suggest modifications or improvements to our documentation email our support directly!