FragmentStatistics
Version 5.9.4
Contents
Introduction
FragmentStatistics creates statistical results from the output of Fragmenter. The simplest usage is to remove duplicate fragments and sort fragments by occurrence, but FragmentStatistics can also sort fragments by molecule activity or other data read from the input molecules and stored together with the generated fragments.
The input of FragmentStatistics is the output of Fragmenter in cxsmiles format with the following fields:
- SMILES string
- atom labels storing fragment cleavage data
- unique ID (used for fragment duplicate check)
- input molecule data read from SDFile tag (optional, e.g. molecule activity)
The output of FragmentStatistics is a sorted cxsmiles table with the following data:
- SMILES string
- atom labels storing fragment cleavage data
- atom count
- fragment counts per activity categories (number of identical fragments in each activity category, one field for each)
- score
Fragments are sorted by activity which is calculated in form of a scoring function:
acx*(w1*c1 + w2*c2 + ... + wN*cN)where:
acis the heavy atom countw1, w2, ..., wNare the category weights in descending order (default: from+1to-1, equidistant)c1, c2, ..., cNare the fragment counts in each category, in descending activity orderxis the exponent of the heavy atom count (default:1)
If there is no activity data then FragmentStatistics simply removes fragment duplicates
and sorts fragments by acx*c1 where c1 is the
fragment count. By default the exponent is 1 and the score is thus
ac*c1.
If there are two activity categories then the default scoring function is
ac*(c1 - c2), if there are three categories, then it is
ac*(c1 - c3).
Examples
- Two activity ranges with cutoff value
0.5:
command line:
-c "0.5"scale: ------------------------|-------------------- < 0.5: Inactive 0.5 >= 0.5: Active
Name Activity value Weight Active >= 0.5+1Inactive < 0.5-1Scoring formula:
ac * (#(Active) - #(Inactive))
- Discrete activity values:
command line:
-r "PIC NAN MIC MIL LESS INA"Name Activity value Weight Picomolar inhibitor PIC+1Nanomolar inhibitor NAN+0.6Micromolar inhibitor MIC+0.2Millimolar inhibitor MIL-0.2Less than millimolar LESS-0.6Inactive INA-1Scoring formula:
ac * (#(Picomolar) + 0.6* #(Nanomolar) + 0.2* #(Micromolar) - - 0.2* #(Millimolar) - 0.6* #(Less than millimolar) - #(Inactive))
A set of working examples is also available.
Usage
fragstat [<options>] [<input file>]
Prepare the usage of the fragstat script or batch file
as described in Preparing the Usage of JChem
Batch Files and Shell Scripts.
Alternatively, the FragmentStatistics class can be directly invoked:
Win32 / Java 2 (assuming that JChem is installed in c:\jchem):
java -cp "c:\jchem\lib\jchem.jar;%CLASSPATH%" \
chemaxon.fragmenter.FragmentStatistics \
[<options>] [<input files/strings>]
Unix / Java 2 (assuming that JChem is installed in /usr/local/jchem):
java -cp "/usr/local/jchem/lib/jchem.jar:$CLASSPATH" \
chemaxon.fragmenter.FragmentStatistics \
[<options>] [<input files/strings>]
Options
-h, --help this help message
-s, --help-scoring help on fragment scoring
-c, --cutoffs <cutoffs> category classes with continuous range
cutoffs: a list of cutoff values
defining activity intervals
(e.g. "0.5 2.3 5.3")
-r, --range <range> category classes with discrete range
range: a list of all possible values
in ascending activity order
(e.g. "0 1 2 3")
-t, --order-type <order> category activity order
a: ascending (the smaller the more active)
d: descending (the bigger the more active)
(default: d)
-p, --attachment-point <N|S|A> fragment attachment points:
N: no marker atoms
S: denoted by any-atom (star) markers
A: denoted by Al and Ar atom markers
(Al: aliphatic, Ar: aromatic)
(default: N)
-a, --all-fragments display all fragments in output
(default: only actives in highest category)
-m, --min-atom-count minimum number of heavy atoms
required in a fragment
(default: 0 - no restriction)
-e, --default-activity <value> default activity value set for fragments
with no activity value
(default: skip these fragments)
-d, --display-header display table header in output
(otherwise displays only cxsmiles)
-o, --output <filepath> output file path (default: stdout)
FragmentStatistics takes its input from the cxsmiles output of
Fragmenter and sorts fragments
with duplicate filtering. If the fragmented molecules contain activity data
in an SDF field then Fragmenter can be run with
the --statistics parameter
to store this data in the created fragments. FragmentStatistics then can be used
to sort fragments by activity measured by a scoring function
(for help on scoring and its parameters, type fragstat -s).
Activity categories are determined either by cutoff values specified in the
--cutoffs parameter or else by the complete activity range specified
in the --range parameter.
In the former case the cutoff values determine a finite set of activity intervals
(the first and the last interval is half-infinite).
In the latter case the activity range is finite and the complete set of activity
values are listed in the --range parameter.
The command line parameter --order-type determines whether large
activity is described with big or small numerical activity data. By default,
FragmentStatistics takes bigger values as more active. This is the case when the
activity data determines the activity itself, the opposite should be specified if
the activity is given as minimal concentration needed to achieve some chemical effect.
The command line parameter --attachment-point determines whether attachment points
are represented by specific marker atoms. These atoms can be either any-atoms or Al and
Ar atoms depending on whether their substituents are aliphatic or aromatic in the input
molecule. The scoring function depends on this parameter setting: attachment point denoting atoms are skipped
when counting heavy atoms. It is important to set this parameter to the same value
that was used in the fragment command producing the fragments. No marker atoms are considered
by default.
If the command line parameter --all-fragments is specified then the output
contains all fragments, otherwise only actives appearing in the highest activity category
are included in the output. If there are no activity categories then all fragments are
regarded as active.
If the command line parameter --min-atom-count is specified then
fragments with less heavy atoms than this limit are excluded from the statistics.
If a default activity value is specified in the command line parameter
--default-activity then this activity value is set for fragments with no activity
value. If this parameter is omitted then fragments with no activity value are skipped.
If the command line parameter --display-header is specified then the output
table header is included in the output. This is useful when the output is read as a data
table, but in this case the output is not a cxsmiles molecule file and cannot be mview-ed.
Input
The input is the cxsmiles output of Fragmenter with specific fields.
If no input file name or input string is specified in the command line then input is taken from the standard input.
Output
FragmentStatistics writes output molecules in cxsmiles format
with specific fields and an optional header line.
If the --output is omitted, results are written to the standard output.
Examples
In the examples below, we first fragment the input molecules in
mols.sdf where activity data is given in the
ACTIVITY SDF field:
![]() |
We use Fragmenter.xml as Fragmenter configuration. Note, that we generate all fragments: there is no limit on the number of fragment sets or fragments per fragment set.
- Fragments structures from the mols.sdf file
and writes the molecule fragments with activity values read from the
ACTIVITYSDF field to fragments.cxsmiles:fragment -c Fragmenter.xml -s ACTIVITY mols.sdf -o fragments.cxsmiles
Some of the generated fragments are shown below:
fragments.cxsmiles 
- Sort fragments with duplicate filtering by running FragmentStatistics
with no activity categories:
fragstat fragments.cxsmiles -o sorted1.cxsmiles
The first four fragments are shown below:
sorted1.cxsmiles 
Observe, that big fragments take precedence.
- Modify the order of fragments by adjusting the
acx*c1scoring function. Set the exponent in the-xparameter:fragstat fragments.cxsmiles -x 0.4 -o sorted2.cxsmiles
The first four fragments are shown below:
sorted2.cxsmiles 
Observe, that smaller fragments with large occurrence are taken first.
- Sort fragments by occurrence only (scoring function:
c1) by setting this exponent to0:fragstat fragments.cxsmiles -x 0 -o sorted3.cxsmiles
The first four fragments are shown below:
sorted3.cxsmiles 
Observe, that fragments with large occurrence are taken first irrespective of their heavy atom counts.
- Now skip small fragments from the previous output by setting the minimum heavy atom
count in the
-mparameter to3. In this way we sort fragments by occurrence with skipping fragments with1-2heavy atoms:fragstat fragments.cxsmiles -x 0 -m 3 -o sorted4.cxsmiles
The first four fragments are shown below:
sorted4.cxsmiles 
- Make statistics with only molecule 4 (bornaprolol)
being in the inactive category:
fragstat fragments.cxsmiles -c 1 -o stat1.cxsmiles
The first four fragments are shown below:
stat1.cxsmiles 
- Make statistics with
3categories, only molecule 3 (bopindolol) being in the active category, only molecule 4 (bornaprolol) being in the inactive category, others in between:fragstat fragments.cxsmiles -c "1 40" -o stat2.cxsmiles
The first four fragments are shown below:
stat2.cxsmiles 
Note, that multiple cutoff values should be enclosed in quotes.
- Make statistics with
4categories, each molecule being in a separate category, specify discrete range:fragstat fragments.cxsmiles -r "0.05 4 5 50" -o stat3.cxsmiles
The first four fragments are shown below:
stat3.cxsmiles 
Note, that multiple cutoff values should be enclosed in quotes. In case of discrete range, category values are matched by exact string matching, activity values can also be letters or other non-numerical strings.
Do you have a question? Would you like to learn more? Please browse among the related topics on our support forum or search the website. If you want to suggest modifications or improvements to our documentation email our support directly!

