FragmentStatistics

Version 5.0.3

Contents

 

Introduction

FragmentStatistics creates statistical results from the output of Fragmenter. The simplest usage is to remove duplicate fragments and sort fragments by occurrence, but FragmentStatistics can also sort fragments by molecule activity or other data read from the input molecules and stored together with the generated fragments.

The input of FragmentStatistics is the output of Fragmenter in cxsmiles format with the following fields:

  1. SMILES string
  2. atom labels storing fragment cleavage data
  3. unique ID (used for fragment duplicate check)
  4. input molecule data read from SDFile tag (optional, e.g. molecule activity)

The output of FragmentStatistics is a sorted cxsmiles table with the following data:

  1. SMILES string
  2. atom labels storing fragment cleavage data
  3. atom count
  4. fragment counts per activity categories (number of identical fragments in each activity category, one field for each)
  5. score

Fragments are sorted by activity which is calculated in form of a scoring function:

acx*(w1*c1 + w2*c2 + ... + wN*cN)
where:

If there is no activity data then FragmentStatistics simply removes fragment duplicates and sorts fragments by acx*c1 where c1 is the fragment count. By default the exponent is 1 and the score is thus ac*c1.

If there are two activity categories then the default scoring function is ac*(c1 - c2), if there are three categories, then it is ac*(c1 - c3).

Examples

  1. Two activity ranges with cutoff value 0.5:

    command line: -c "0.5"

    scale:
    ------------------------|--------------------
    < 0.5: Inactive        0.5     >= 0.5: Active
    
    NameActivity valueWeight
    Active>= 0.5+1
    Inactive< 0.5-1

    Scoring formula:

    ac * (#(Active) - #(Inactive))

  2. Discrete activity values:

    command line: -r "PIC NAN MIC MIL LESS INA"

    NameActivity valueWeight
    Picomolar inhibitorPIC+1
    Nanomolar inhibitorNAN+0.6
    Micromolar inhibitorMIC+0.2
    Millimolar inhibitorMIL-0.2
    Less than millimolarLESS-0.6
    InactiveINA-1

    Scoring formula:

    ac * (#(Picomolar) + 0.6* #(Nanomolar) + 0.2* #(Micromolar) - 
          - 0.2* #(Millimolar) - 0.6* #(Less than millimolar) - #(Inactive))
    

A set of working examples is also available.

Usage

    fragstat [<options>] [<input file>] 

Prepare the usage of the fragstat script or batch file as described in Preparing the Usage of JChem Batch Files and Shell Scripts.

Alternatively, the FragmentStatistics class can be directly invoked:

Win32 / Java 2 (assuming that JChem is installed in c:\jchem):

    java -cp "c:\jchem\lib\jchem.jar;%CLASSPATH%" \
        chemaxon.reaction.FragmentStatistics  \
	[<options>] [<input files/strings>]

Unix / Java 2 (assuming that JChem is installed in /usr/local/jchem):

    java -cp "/usr/local/jchem/lib/jchem.jar:$CLASSPATH" \
        chemaxon.reaction.FragmentStatistics  \
	[<options>] [<input files/strings>]

Options

  -h, --help                      this help message
  -s, --help-scoring              help on fragment scoring
  -c, --cutoffs <cutoffs>         category classes with continuous range
                                  cutoffs: a list of cutoff values
                                  defining activity intervals
                                  (e.g. "0.5 2.3 5.3")
  -r, --range <range>             category classes with discrete range
                                  range: a list of all possible values
                                  in ascending activity order
                                  (e.g. "0 1 2 3")
  -t, --order-type <order>        category activity order
                                  a: ascending  (the smaller the more active)
                                  d: descending (the bigger the more active)
                                  (default: d)
  -a, --all-fragments             display all fragments in output
                                  (default: only actives in highest category)
  -m, --min-atom-count            minimum number of heavy atoms
                                  required in a fragment
                                  (default: 0 - no restriction)
  -e, --default-activity <value>  default activity value set for fragments
                                  with no activity value
                                  (default: skip these fragments)
  -d, --display-header            display table header in output
                                  (otherwise displays only cxsmiles)
  -o, --output <filepath>         output file path (default: stdout)

FragmentStatistics takes its input from the cxsmiles output of Fragmenter and sorts fragments with duplicate filtering. If the fragmented molecules contain activity data in an SDF field then Fragmenter can be run with the --statistics parameter to store this data in the created fragments. FragmentStatistics then can be used to sort fragments by activity measured by a scoring function (for help on scoring and its parameters, type fragstat -s). Activity categories are determined either by cutoff values specified in the --cutoffs parameter or else by the complete activity range specified in the --range parameter. In the former case the cutoff values determine a finite set of activity intervals (the first and the last interval is half-infinite). In the latter case the activity range is finite and the complete set of activity values are listed in the --range parameter.

The command line parameter --order-type determines whether large activity is described with big or small numerical activity data. By default, FragmentStatistics takes bigger values as more active. This is the case when the activity data determines the activity itself, the opposite should be specified if the activity is given as minimal concentration needed to achieve some chemical effect.

If the command line parameter --all-fragments is specified then the output contains all fragments, otherwise only actives appearing in the highest activity category are included in the output. If there are no activity categories then all fragments are regarded as active.

If the command line parameter --min-atom-count is specified then fragments with less heavy atoms than this limit are excluded from the statistics.

If a default activity value is specified in the command line parameter --default-activity then this activity value is set for fragments with no activity value. If this parameter is omitted then fragments with no activity value are skipped.

If the command line parameter --display-header is specified then the output table header is included in the output. This is useful when the output is read as a data table, but in this case the output is not a cxsmiles molecule file and cannot be mview-ed.

 

Input

The input is the cxsmiles output of Fragmenter with specific fields.

If no input file name or input string is specified in the command line then input is taken from the standard input.

 

Output

FragmentStatistics writes output molecules in cxsmiles format with specific fields and an optional header line. If the --output is omitted, results are written to the standard output.

 

Examples

In the examples below, we first fragment the input molecules in mols.sdf where activity data is given in the ACTIVITY SDF field:

mols.sdf

We use Fragmenter.xml as Fragmenter configuration. Note, that we generate all fragments: there is no limit on the number of fragment sets or fragments per fragment set.

  1. Fragments structures from the mols.sdf file and writes the molecule fragments with activity values read from the ACTIVITY SDF field to fragments.cxsmiles:
    fragment -c Fragmenter.xml -s ACTIVITY mols.sdf -o fragments.cxsmiles
    

    Some of the generated fragments are shown below:

    fragments.cxsmiles
  2. Sort fragments with duplicate filtering by running FragmentStatistics with no activity categories:
    fragstat fragments.cxsmiles -o sorted1.cxsmiles
    

    The first four fragments are shown below:

    sorted1.cxsmiles

    Observe, that big fragments take precedence.

  3. Modify the order of fragments by adjusting the acx*c1 scoring function. Set the exponent in the -x parameter:
    fragstat fragments.cxsmiles -x 0.4 -o sorted2.cxsmiles
    

    The first four fragments are shown below:

    sorted2.cxsmiles

    Observe, that smaller fragments with large occurrence are taken first.

  4. Sort fragments by occurrence only (scoring function: c1) by setting this exponent to 0:
    fragstat fragments.cxsmiles -x 0 -o sorted3.cxsmiles
    

    The first four fragments are shown below:

    sorted3.cxsmiles

    Observe, that fragments with large occurrence are taken first irrespective of their heavy atom counts.

  5. Now skip small fragments from the previous output by setting the minimum heavy atom count in the -m parameter to 3. In this way we sort fragments by occurrence with skipping fragments with 1-2 heavy atoms:
    fragstat fragments.cxsmiles -x 0 -m 3 -o sorted4.cxsmiles
    

    The first four fragments are shown below:

    sorted4.cxsmiles
  6. Make statistics with only molecule 4 (bornaprolol) being in the inactive category:
    fragstat fragments.cxsmiles -c 1 -o stat1.cxsmiles
    

    The first four fragments are shown below:

    stat1.cxsmiles
  7. Make statistics with 3 categories, only molecule 3 (bopindolol) being in the active category, only molecule 4 (bornaprolol) being in the inactive category, others in between:
    fragstat fragments.cxsmiles -c "1 40" -o stat2.cxsmiles
    

    The first four fragments are shown below:

    stat2.cxsmiles

    Note, that multiple cutoff values should be enclosed in quotes.

  8. Make statistics with 4 categories, each molecule being in a separate category, specify discrete range:
    fragstat fragments.cxsmiles -r "0.05 4 5 50" -o stat3.cxsmiles
    

    The first four fragments are shown below:

    stat3.cxsmiles

    Note, that multiple cutoff values should be enclosed in quotes. In case of discrete range, category values are matched by exact string matching, activity values can also be letters or other non-numerical strings.

 
Copyright © 1999-2008 ChemAxon Ltd.    All rights reserved.