Fragment - The RECAP method

Version 5.9.4

Contents

Further reading

 

Cleavage rules


Fragmenter fragments molecules based on predefined cleavage rules. The cleavage rules are given in form of reaction molecules in the configuration XML.

By default, all non-ring bonds matching the cleavage bonds in the rules are cleaved. However, it is possible to provide a revision algorithm that forbids certain cuts depending on predefined criteria (e.g. the resulting fragment size, the structural environment of the bond, the number of cleaved bonds in the resulting fragments, etc.). Currently one such algorithm is implemented: the RECAP method.

The RECAP algorithm raises the following cleavage revision rules:

  1. Never cut a hydrogen-connecting bond.
  2. Never cut a bond connecting a ring-carbon and a hetero atom (optional).
  3. Never cut ring bonds. (Fragmenter always keeps this rule, we add it here for completeness.)
  4. Refuse a cut if any of the resulting fragments is on the specified Notlist.
  5. Refuse a cut if the number of open bonds in any of the resulting fragments exceeds the specified limit.
  6. Refuse a cut if the number of atoms in any of the resulting fragments is less than the predefined minimal atom count.

The cut-bond reactions, the forbidden fragment list (notlist), the maximum number of open bonds per fragment and the minimum number of atoms per fragment are specified in the configuration XML.

The following cleavage data is stored in SDF tags (molecule properties) for each fragment, if specified in the configuration:

  • unique fragment ID (Uid): the reaction indices with atom maps for each atom, separated by semicolons, in canonincal atom order - that is, two fragment IDs coincide if and only if the fragments represent the same molecular structure with corresponding cleavage data
  • cleavage reaction IDs with atom maps for each atom, separated by semicolons (CutIds)
  • the number of cleaved bonds for each atom, separated by semicolons (CutCounts)
  • the total number of cleaved bonds (the sum of the cleaved bonds per atom) (CutSum)
  • the multiplicity of the fragment (Count): the number of fragments with the same molecular structure and same unique fragment ID (Uid) found in the fragmentation of the input molecule
  • the fragment set indices of these identical fragments, separated by commas (FragmentSets)
  • the index or ID of the input molecule in the input file

Example

Apply the RECAP cleavage revision algorithm for the ether and amine cleavage reactions:

Note, that usually the application of these rules will not result in single oxygen or nitrogen atom fragments: setting the MinAtomCount RECAP parameter to a value greater than 1 will prevent Fragmenter from creating single-atom fragments.

Take the input molecule:

input.mol

Then the following fragments will be generated if bond cleavage between ring-carbons and hetero atoms (see revision rule 2. above) is forbidden (FragmenterRecap1.xml):

fragment -c FragmenterRecap1.xml input.mol -y 1 -f sdf:-a

while one more cleavage is performed if bond cleavage between ring-carbons ant hetero atoms (see revision rule 2. above) is allowed (FragmenterRecap2.xml):

fragment -c FragmenterRecap2.xml input.mol -y 1 -f sdf:-a

Note, that we set sdf:-a as output format in the -f parameter because our fragments are aromatized due to standardization, but the SDF format is supposed to store the dearomatized form.

A set of working examples is also available.

Usage

    fragment -c <config file> [<options>] [<input files/strings>] 

Prepare the usage of the fragment script or batch file as described in Preparing the Usage of JChem Batch Files and Shell Scripts.

Alternatively, the Fragmenter class can be directly invoked:

Win32 / Java 2 (assuming that JChem is installed in c:\jchem):

    java -cp "c:\jchem\lib\jchem.jar;%CLASSPATH%" \
        chemaxon.fragmenter.Fragmenter  \
	-c <config file> [<options>] \
	[<input files/strings>]

Unix / Java 2 (assuming that JChem is installed in /usr/local/jchem):

    java -cp "/usr/local/jchem/lib/jchem.jar:$CLASSPATH" \
        chemaxon.fragmenter.Fragmenter  \
	-c <config file> [<options>] \
	[<input files/strings>]

Options

General Options: 
  -h, --help                          this help message
  -c, --config <filepath>             configuration XML file
  -x, --fragment-count <count>        the maximum number of fragments
                                      per fragment set (default: unlimited)
  -y, --set-count <count>             the maximum number of fragment sets
                                      per molecule (default: unlimited)
  -e, --extensive                     include extendable cut sets
  -i, --id <SDF tag>                  SDFile tag that stores the molecule ID
                                      (default: the molecule index)
  -s, --statistics <SDF tag>          SDFile tag that stores data used later
                                      for making statistics (default: none)
  -g, --ignore-error                  continue with next molecule on error

Output Options:
  -f, --format <format>               output file format (default: cxsmiles)
  -o, --output <filepath>             output file path (default: stdout)
  -k, --skip-unfragmented             skip unfragmented molecules in output
  -d, --data <N|L|I|LI>               fragment cut-data:
                                      N:  no data stored
                                      L:  in atom labels
                                      I:  in fragment ID
                                      LI: in both atom labels and fragment ID
                                      (default: LI)
  -p, --attachment-point <N|S|A>      fragment attachment points:
                                      N:  no marker atoms
                                      S:  denoted by any-atom (star) markers
                                      A:  denoted by Al and Ar atom markers
                                          (Al: aliphatic, Ar: aromatic)
                                      (default: N)

The command line parameter --config is mandatory. This specifies the path and filename of a configuration file without which the program cannot operate. A detailed description of the format of this configuration file is given below.

The command line parameter --fragment-count specifies the maximum number of fragments to be generated in one fragment set. This parameter overrides the MaxFragmentCount attribute specified in the configuration XML.

The command line parameter --set-count specifies the maximum number of fragment sets to be generated per molecule. This parameter overrides the MaxSetCount attribute specified in the configuration XML.

If the command line parameter --extensive is specified then fragment sets originating from cut sets that can be extended by adding more cleavage bonds are also added to the result fragments. This parameter overrides the Extensive attribute specified in the configuration XML.

If the command line parameter --skip-unfragmented is specified then unfragmentable molecules will not be written to the output.

By default, cleavage data is stored and visualized in the atom labels and associated fragment IDs (these are used during fragment duplicate check in FragmentStatistics). Both can be changed by specifying the command line parameter --data:

  • N: fragmentation data is not stored
  • L: data written in atom labels
  • I: fragment ID is written
  • LI: data is written to both atom labels and fragment IDs (default)
Note, that Fragmenter can store cleavage data in the following ways:

  1. in the SDF tag specified in the CutIds attribute of the SDFTags element in the configuration XML
  2. in atom labels
  3. in fragment ID - in the UID SDF tag or in cxsmiles field

The first method is only available for SDF / MRV output. Atom labels are also useful for visualizing cleavage data at each atom, however, they may be disturbing if there are long cleavage data strings or atom labels are used for a different purpose.

The command line parameter --attachment-point determines whether attachment points are represented by specific marker atoms. These atoms can be either any-atoms or Al and Ar atoms depending on whether their substituents are aliphatic or aromatic in the input molecule. If marker atoms are added, then they are connected with the matching target bond and cleavage data is written at both end-atoms of the connection bond. No marker atoms are added by default.

The command line parameter --id specifies the SDF tag storing the molecule ID to be written to the output SDF as reference to the source molecule that the fragment has been generated from.

The command line parameter --statistics specifies an SDFile tag name that stores some of the input molecule: this can be a real number from a continuous range or a value from a discrete range, depending on the type of the data. This data is copied into the fragments and will be used by FragmentStatistics when counting fragments falling into a certain data class (which is a user-defined interval in a continuous range and a single value in a discrete range). Without this data field, FragmentStatistics simply counts identical fragments. Note, that FragmentStatistics requires fragments written in cxsmiles format, therefore do not change the default output format of Fragmenter when FragmentStatistics is to be run. For details on making statistical data on fragments generated by Fragmenter, refer to the FragmentStatistics Manual.

If the command line parameter --ignore-error is specified, then import/export errors will not stop the processing but the error is written to the console and the molecule is skipped. By default, the program exits in case of molecule import/export erros.

 

Input

Most molecular file formats are accepted ( MDL molfile, Compressed molfile, SDfile, Compressed SDfile, SMILES, etc.).

If no input file name or input string is specified in the command line then input is taken from the standard input.

 

Output

By default, Fragmenter writes output molecules in cxsmiles format with the following fields:

  1. SMILES string
  2. atom labels storing fragment cleavage data
  3. unique ID (used for fragment duplicate check within one input molecule)
  4. input molecule data read from SDFile tag (only if the command line parameter --statistics is specified)
  5. fragment count (number of identical fragments within one input molecule)

Other output formats can be specified in the --format parameter. Note, that FragmentStatistics requires fragments written in cxsmiles format.

The --output parameter specifies the output file path. If omitted, results are written to the standard output.

 

Configuration

The cleavage reactions are determined by the configuration file (specified following the --config mandatory command line parameter).

An optional standardization section can be provided to perform pre-standardization on reaction reactants, products and input molecules. See the Standardizer manual for information on standardization.

The configuration XML may also specify reviser algorithm parameters in a separate section. If this section is omitted then no cleavage revision is made, that is, all bonds matching the cleavage reaction cleavage bonds are cleaved. Currently only the RECAP algorithm is implemented, therefore there are only two options:

The cleavage reactions are given in <Action> subsections under the <Fragmenter> section. Each reaction has an ID attribute and a Structure attribute as well as an optional Type attribute which specifies whether the Structure attribute is a file path (Type="path") or a molecule string (Type="string"). More actions can have the same ID attribute; in this way alternate reaction definitions may be specified for one cleavage rule (see the ether definitions below). If the Type attribute is omitted then the structure type is automatically decided based on its format which gives the correct result in most cases.

For a description of reaction mapping, see the Reaction mapping section of the Reactor Manual.

Unlike in case of usual reaction definitions, here atom maps do not have to be unique: identical atom maps denote symmetric atoms (see the ether and amine reactions in the introduction example). In the ID attribute together with the matching reaction atom map is written in the fragment SDF tag to identify the cleavage bond endpoint.

The SDFTags section specifies which cleavage data should be stored in fragment SDF tags and specifies these SDF tag names (if the attribute is omitted then the corresponding data will not be stored).

The Fragmentation section specifies the following fragmentation parameters:

An optional cleavage bond reviser algorithm implementation may be applied with parameters listed under the <Reviser> section. The implementation java class is specified in the <Class> attribute. Reviser specific parameters are specified in sunsections of the reviser section. Currently only the RECAP reviser algorithm is available.

RECAP parameters that are specified in subsections are:

  1. The NotList: the list of forbidden fragments. Molecules are specified the Structure and the optional Type attribute similarly to reaction definitions.
  2. Limits:
    • MaxCutCount: the maximum number of cleavage bonds per fragment
    • MinAtomCount: the minimum number of atoms per fragment
  3. Options:
    • CutRingCHetero: "true" if cleavage between ring carbons and hetero atoms is allowed, "false" otherwise (default: "false")

Example

<FragmenterConfiguration Version ="0.1" schemaLocation="fragment_schema.xsd">

<Standardizer>
    <Actions>
	<Reaction ID="plusminus" Structure="[*+:1][*-:2]>>[*:1]=[*:2]"/>
	<Action ID="aromatize" Act="aromatize"/>
    </Actions>
</Standardizer>

<Fragmenter>
    <Actions>
        <Action ID="amide" Structure="[O:3]=[C!$(C([#7])(=O)[!#1!#6]):2]-[#7!$([#7][!#1!#6]):1]>>[O:3]=[C:2].[#7:1]"/>
	<Action ID="ester" Structure="[#6!$([#6](O)~[!#1!#6])][O:2][C:1]=O>>[C:1]=O.[#6][O:2]"/>
	<Action ID="amine" Structure="[#6:2]-[N!$(N[#6]=[!#6])!$(N~[!#1!#6])!X4:1]>>[N:1].[#6:2]"/>
	<Action ID="urea" Structure="N[C:1]([N:2])=O>>N[C:1]=O.[N:2]"/>
	<Action ID="ether" Structure="[#6]-[O!$(O[#6]~[!#1!#6]):1]-[#6:2]>>[#6:2].[O:1]-[#6]"/>
	<Action ID="olefin" Structure="[C:1]=[C:1]>>[C:1].[C:1]"/>
	<Action ID="quatN" Structure="[#6:1]-[N$(N([#6])([#6])([#6])[#6])!$(NC=[!#6]):2]>>[#6:1].[N:2]"/>
	<Action ID="aromN-carbon" Structure="[n:1]-[#6!$([#6]=[!#6]):2]>>[n:1].[#6:2]"/>
	<Action ID="lactamN-carbon" Structure="[C:3](=[O:4])@-[N:1]!@-[#6!$([#6]=[!#6]):2]>>[C:3](=[O:4])[N:1].[#6:2]"/>
	<Action ID="aromcarbon-aromcarbon" Structure="[c:1]-[c:1]>>[c:1].[c:1]"/>
	<Action ID="sulphonamide" Structure="[#7:1][S:2](=O)=O>>[#7:1].[S:2](=O)=O"/>
    </Actions>
    <Params>
	<SDFTags CutIds="REACTIONS" CutCounts="COUNTS" CutSum="SUM" Count="COUNT" FragmentSets="FRAGMENTSETS"/>
	<Fragmentation MaxFragmentCount="3" MaxSetCount="30" Extensive="false"/>
    </Params>
</Fragmenter>

<Reviser>
    <Recap Class="chemaxon.fragmenter.Recap">
	<Notlist>
    	    <Mol ID="butyl" Structure="CCCC"/>
    	    <Mol ID="ibutyl" Structure="CC(C)C"/>
	</Notlist>
	<Params> 
	    <Limits MaxCutCount="4" MinAtomCount="4"/>
	    <Options CutRingCHetero="false"/>
	</Params>
    </Recap>
</Reviser>

</FragmenterConfiguration>
 

Examples

  1. Fragments structures from the mols.sdf file and writes the molecule fragments to the standard output in cxsmiles format:
    fragment -c Fragmenter.xml mols.sdf
    
  2. The same with SMILES string input:
    fragment -c Fragmenter.xml "CC(CCN(C)COCCC1=CC=CC=C1C2=CC=CC=C2)COCN" "CCCCN(C)C(C(=O)C1CCCC(Cl)C1)C(C)C(Cl)Cl"
    
  3. Applies extensive fragmentation:
    fragment -c Fragmenter.xml -e mols.sdf
    
  4. Creates maximum 5 fragment sets and maximum 4 fragments in each fragment set:
    fragment -c Fragmenter.xml -x 5 -y 4 mols.sdf
    
  5. Performs fragmentation and writes fragments to o.sdf, then displays the result in MarvinView:
    fragment -c Fragmenter.xml mols.sdf -f sdf -o o.sdf
    mview o.sdf
    
  6. The same but directly pipes output to MarvinView:
    fragment -c Fragmenter.xml mols.sdf -f sdf | mview -
    

    Note that such piping does not work in Windows.

  7. Prepares simple fragment statistics (counts fragments and sorts the result by occurrences):
    fragment -c Fragmenter.xml mols.sdf -o fragments.cxsmiles
    fragstat fragments.cxsmiles
    

    Or in a single command:

    fragment -c Fragmenter.xml mols.sdf | fragstat 
    

    Note that such piping does not work in Windows.

  8. Prepares fragment statistics based on molecule data stored in the DATA SDFile tag of the input molecules (counts fragments by data classes and sorts the result by occurrences):
    fragment -c Fragmenter.xml -s DATA mols.sdf -o fragments.cxsmiles
    fragstat -c "0.2 0.5" fragments.cxsmiles
    

    Or in a single command:

    fragment -c Fragmenter.xml -s DATA mols.sdf | fragstat -c "0.2 0.5"
    

    Note that such piping does not work in Windows.

 

Notes

  1. Appropriate fragmentation parameter settings can be used to avoid combinatorial explosure:

  2. Fragment repetition is detected by first comparing the unique ID of the two fragments, then if the two IDs are the same, structure search is performed to test exact molecular structure matching. Note that this duplicate check is performed for each input molecule separately, therefore duplicated fragments may occur if they correspond to different input molecules. For a complete duplicate check with fragment sorting based on occurrences, create Fragment Statistics from the fragments created by Fragmenter.

 

References

  1. RECAP - Retrosynthetic Combinatorial Analysis Procedure: A Powerful New Technique for Identifying Privileged Molecular Fragments with Useful Applications in Combinatorical Chemistry In: J. Chem. Inf. Comput. Sci. 1998, 38. 511-522

Do you have a question? Would you like to learn more? Please browse among the related topics on our support forum or search the website. If you want to suggest modifications or improvements to our documentation email our support directly!