Fragmenter Examples

Contents

 

Introduction

These examples demonstrate the use of the Fragmenter and the FragmentStatistics programs.

First we show how to run the fragment command and explain some of their command line options as well as its configuration.

FragmentStatistics is a supplementary tool which takes the Fragmenter cxsmiles output as input, performs duplicate filtering of fragments as well as an optional categorization by chemical activity data. Fragments are sorted by a scoring function which is a weighted combination of the atom count and occurrence rates in each category. The fragstat command with command line options is demonstrated by 50 sample molecules including activity data.

 

Prerequisites

These examples run the fragment UNIX shell script under UNIX / Linux or the fragment.bat batch file under Windows.

To run these examples:

  1. The Java Virtual Machine version 1.5 or higher and JChem have to be installed on your system.

  2. The PATH environment variable has to be set as described in the Preparing and Running JChem's Batch Files and Shell Scripts manual.

  3. A command shell (under UNIX / Linux: your favorite shell, under Windows: a Cygwin shell or a Command Prompt) has to be run in the fragmenter example directory.
    In UNIX / Linux:
    cd jchem/examples/fragmenter
    
    In Windows:
    cd jchem\examples\fragmenter
    

 

Fragmentation Examples

The examples below fragment the following input molecule (stored in input.mol):

the input molecule

The Fragmenter has one generic rule: Fragmenter never cleaves a ring-bond. In this way, each cleavage bond increases the number of fragments by 1, and the number of fragments in a complete fragmentation (called a fragment set) is one more than the number of cleavage bonds corresponding to the fragmentation.

The first set of examples demonstrate molecule fragmentation by cleavage rules defined in the Fragmenter.xml XML configuration file. The cleavage reactions are shown below:

  1. amide cleavage rule:

    amide

  2. ester cleavage rule:

    ester

  3. amine cleavage rule:

    amine

  4. urea cleavage rule:

    urea

  5. ether cleavage rule:

    ether

  6. olefin cleavage rule:

    olefin

  7. quatN cleavage rule:

    quatN

  8. cleavage rule between aromatic carbons:

    aromc-aromc

  9. sulphonamide cleavage rule:

    sulphonamide

The configuration file Fragmenter.xml also contains a standardization section which is used to standardize the input molecule. The current standardization only aromatizes the input molecule. Refer to the Standardizer Manual and its configuration section for details.

  1. Fragment the input molecule by:
    fragment -c Fragmenter.xml input.mol -f sdf:-a -o fragments1.sdf
    

    Note, that we set sdf:-a as output format in the -f parameter because our fragments are aromatized due to standardization, but the SDF format is supposed to store the dearomatized form.

    The resulting fragments are stored in fragments1.sdf.

    By default, Fragmenter writes its output in cxsmiles format that can be processed by Fragment Statistics. fragmentation cleavage data is stored in the atom labels. In these examples we use SDF format where fragmentation cleavage data is also stored in SDF tags. You can see these data items in MView by setting Table / Show Fields (only some of the fragments shown):

    default fragmentation result

    You may not be satisfied with this result very much. There are very small fragments as well as fragments that are not very interesting. There are a couple of ways to improve this result, as you will see in the examples that follow.

  2. Modify the fragmentation parameters: you can either do this directly in the Fragmentation subsection of the Params section of the Fragmenter.xml configuration file, or else override its options by command line options. This example applies the latter. We reduce the number of fragments in a complete fragmentation of the molecule (called a fragment set) from 4 to 3 (option -x) and reduce the number of fragment sets to be generated from 8 to 2 (option -y):
    fragment -c Fragmenter.xml input.mol -f sdf:-a -o fragments2.sdf -x 3 -y 2
    

    The resulting fragments are stored in fragments2.sdf.

    Some fragments from the result set (still not optimal) are shown below:

    fragmentation result with max 2 fragmentations, max 3 fragments in each

    You can also include intermediate fragment sets: those that can be subdivided by adding more cleavage bonds. In our case this means that we include the 2-fragment fragmentations that can be subdivided to 3 fragments by cleaving one more bond, as well as the starting molecule itself. Now increase the number of fragment sets to be generated to a huge value so that all possibilities could be generated (you can do this by setting -y with a huge value or else removing the MaxSetCount attribute from the Fragmenter.xml configuration):

    fragment -c Fragmenter.xml input.mol -f sdf:-a -o fragments3.sdf -x 3 -y 200 -e
    

    The resulting fragments are stored in fragments3.sdf.

    Some sample fragments from the huge result set:

    extensive fragmentation result with max 3 fragments in a fragmentation

    You may still not be satisfied with the results: the next example shows a bunch of customizable rules that can be added in a single option.

  3. The RECAP Algorithm raises some revision rules on cleavage bonds. The idea is, that although a bond may satisfy a cleavage reaction rule defined in the configuration, it still may be inconvenient in the specific situation because:

    • it may cleave an interesting ligand that makes the fragment special
    • it may result in well-known, too small or uninteresting fragments
    • it may result in a fragment having too many open bonds

    For a precise description of the RECAP rules, see The RECAP Algorithm section in the Fragmenter Manual. For a description of the configuration options of the RECAP rules, see The RECAP Parameters section in the Fragmenter Manual. Our corresponding configuration section is the <Recap> section of the FragmenterRecap.xml XML configuration.

    Run Fragmenter with the RECAP rules by adding a Reviser section to your configuration - the new configuration XML is FragmenterRecap.xml:

    fragment -c FragmenterRecap.xml input.mol -f sdf:-a -o fragments4.sdf
    

    The resulting fragments are stored in fragments4.sdf and are shown below:

    RECAP fragmentation result

  4. You can fine-tune your fragmentation by playing with a couple of parameter settings for both the RECAP algorithm and the general fragmentation. For example, you may try teh following: with the same RECAP parameters, set only 2 fragments in a fragment set (-x 2), include all fragmentations with a practically unlimited number of fragment sets (-y 200):
    fragment -c FragmenterRecap.xml input.mol -f sdf:-a -o fragments5.sdf -x 2 -y 200
    

    The resulting fragments are stored in fragments5.sdf.

    Now you can see that the RECAP rules with our configuration allow only 3 cleavage bonds:

    RECAP fragmentation result: the 3 RECAP cleavages

    For comparison, see the same without the RECAP rules:

    fragment -c Fragmenter.xml input.mol -f sdf:-a -o fragments6.sdf -x 2 -y 200
    

    The resulting fragments are stored in fragments6.sdf.

    Some sample fragment pairs from the result:

    Cleavages without the RECAP rules

    Observe, that when we applied the RECAP rules, we had only 3 possible cleavage bonds with 3 resulting fragment sets containing 3*2=6 fragments alltogether, while without these rules we have 10 possible cleavage bonds with 10 resulting fragment sets containing 10*2=20 fragments alltogether with 2 fragment repetitions. The resulting 18 different fragments in the latter case contain some molecules you might do not want to see in a fragmentation. This indicates the strength of the RECAP rules.

In the second set of examples we show a configuration with a simple cleavage rule: cut non-ring single bonds starting from a ring atom. Intuitively speaking, our fragment set will consist of ring systems and connecting chains.

To reach this, we will use the RingChain.xml and the RingChainRecap.xml configuration files containing the following rules:

ring-chain cleavage: [*R:1]-[*R0:2]>>[*R:1].[*R0:2]

ring-ring cleavage: [*R:1]-[*R:1]>>[*R:1].[*R:1]

We use these two rules becuase we want to eclude chain-chain cleavages. As we already mentioned, ring-ring cleavages refer to non-ring bonds only, that is, the two end atoms should belong to different rings because Fragmenter never cuts a ring bond.

In the examples below we use the default cxsmiles output but display the fragments in dearomatized form. We generate one fragment set (see the MaxSetCount parameter in the configuration files) with unlimited number of cleavages (default setting).

  1. Fragment the input molecule by:
    fragment -c RingChain.xml input.mol -o fragments7.cxsmiles
    

    This gives the following fragments:

    fragments7.cxsmiles
  2. Run the above fragmentation with any-atom attachment point markers:
    fragment -c RingChain.xml -p S input.mol -o fragments8.cxsmiles
    

    This gives the following fragments:

    fragments8.cxsmiles
  3. Now set the RECAP parameter MinAtomCount to 2 in order to stick single atoms to rings (note, that we also set CutRingCHetero to true in RECAP):
    fragment -c RingChainRecap.xml input.mol -o fragments9.cxsmiles
    

    This gives the following fragments:

    fragments9.cxsmiles
  4. Run the above fragmentation with Al-Ar attachment point markers, this adds Al atoms for aliphatic and Ar atoms for aromatic attachments:
    fragment -c RingChainRecap.xml input.mol -p A -o fragments10.cxsmiles
    

    This gives the following fragments:

    fragments10.cxsmiles

    Observe, that now we have two CO fragments, one with aliphatic and one with aromatic attachment.

The use and meaning of command-line options in the above commands:

OptionDescriptionDefault
-c configuration file -
-x max number of fragments
in a fragment set
unlimited
-y max number of fragment sets
in a molecule
unlimited
-e include fragment sets
corresponding to extendable
cleavage bond sets
accept only unextendable cleavage bond sets
for creating a fragment set
-f specifies the output file format cxsmiles
-o specifies the output file path standard output (console)
-p specifies the attachment point marker atoms no marker atoms
 

Now it is your turn:

  • change the RECAP parameters in the <RECAP> section of the FragmenterRecap.xml configuration XML (refer to the The RECAP Parameters section in the Fragmenter Manual);
  • change the input molecule input.mol;
  • change the general fragmentation parameters in the <Fragmentation> subsection (also try to delete the MaxFragmentCount and / or the MaxSetCount attributes to see the default (unlimited) behavior).

 

Fragment Statistics Examples

FragmentStatistics can be used for duplicate filtering and sorting fragments created by Fragmenter. FragmentStatistics can also categorize and sort fragments by chemical activity, based on activity data given in a specific SDF field of the input molecules.

We use a 50 molecule sample input stored in beta2_adrenoceptor_antagonists.sdf. Activity values are given in the ACTIVITY SDF field:

beta2_adrenoceptor_antagonists.sdf
  1. First create fragments with activity data in cxsmiles format. For the purpose of fragment statistics, it may be best to start with a broad set of fragments with no reviser algorithm (e.g. RECAP), in extensive mode, with no limitation on the number of fragment sets or on the number of fragments in a fragment set. The scoring function will determine the activity value of each fragment.

    We apply the FragmenterAll.xml Fragmenter configuration to create all fragments:

    fragment -c FragmenterAll.xml beta2_adrenoceptor_antagonists.sdf -s ACTIVITY -o fragments.cxsmiles
    

    Note, that we have to create fragments in the default cxsmiles format if we want to make fragment statistics. The SDF field containg the activity data is specified in the -s parameter. This is optional, only needed for chemical activity based fragment sorting.

    Fragments are stored in fragments.cxsmiles, some sample fragments out of the 897 generated fragments are shown below:

    fragments.cxsmiles

    Note, that field_1 contains the activity data of the corresponding input molecule.

  2. Now make fragment statistics with duplicate filtering and sorting:
    fragstat fragments.cxsmiles -o sorted.cxsmiles
    

    We have 494 fragments sorted by the default scoring function: the product of the atom count and the fragment occurrence:

    sorted.cxsmiles

    Data fields:

    • field_0: atom count
    • field_1: fragment occurrence
    • field_2: score (atom count * fragment occurrence)

  3. Next we include activity data in the statistics, with cutoff 1. This means that molecules with activity value at least 1 are considered active, while all others are inactive. By default, only fragments appearing in the active set are listed in the output (you can include all fragments by specifying the -a parameter).
    fragstat fragments.cxsmiles -c 1 -o stat.cxsmiles
    

    We have 348 active fragments sorted by the default scoring function: the product of the atom count and the difference between fragment occurrences in the active and the inactive sets:

    stat.cxsmiles

    Data fields:

    • field_0: atom count
    • field_1: fragment occurrence in the active set (score >= 1)
    • field_2: fragment occurrence in the inactive set (score < 1)
    • field_3: score (atom count * (active occurrence - inactive occurrence))

    Note, that table header with field captions is included if the -d parameter is specified, however, in this case the output is no longer in cxsmiles format and cannot be directly mview-ed.

The use and meaning of command-line options in the above commands:

OptionDescriptionDefault
-c cutoff values -
-a output all fragments output only actives
-d include table header cxsmiles format, no header
-o specifies the output file path standard output (console)
 

Now it is your turn:

  • change the cutoff values - you can use multiple cutoffs to specify more activity intervals, e.g. try -c "1 6" or -c "1 4.5". Note, that multiple cutoff values should be enclosed in quotes.
  • change scoring parameters - you can get a short help on scoring by typing fragstat -s.
  • Do you have a question? Would you like to learn more?

    Please browse among the related topics on our support forum or search the website. If you want to suggest modifications or improvements to our documentation email our support directly!