Fragment - The RECAP method
Version 5.9.4
Contents
Further reading
Cleavage rules
Fragmenter fragments molecules based on predefined cleavage rules. The cleavage rules are given in form of reaction molecules in the configuration XML.
By default, all non-ring bonds matching the cleavage bonds in the rules are cleaved. However, it is possible to provide a revision algorithm that forbids certain cuts depending on predefined criteria (e.g. the resulting fragment size, the structural environment of the bond, the number of cleaved bonds in the resulting fragments, etc.). Currently one such algorithm is implemented: the RECAP method.
The RECAP algorithm raises the following cleavage revision rules:
- Never cut a hydrogen-connecting bond.
- Never cut a bond connecting a ring-carbon and a hetero atom (optional).
- Never cut ring bonds. (Fragmenter always keeps this rule, we add it here for completeness.)
- Refuse a cut if any of the resulting fragments is on the specified
Notlist. - Refuse a cut if the number of open bonds in any of the resulting fragments exceeds the specified limit.
- Refuse a cut if the number of atoms in any of the resulting fragments is less than the predefined minimal atom count.
The cut-bond reactions, the forbidden fragment list (notlist), the maximum number of open bonds per fragment and the minimum number of atoms per fragment are specified in the configuration XML.
The following cleavage data is stored in SDF tags (molecule properties) for each fragment, if specified in the configuration:
- unique fragment ID (Uid): the reaction indices with atom maps for each atom, separated by semicolons, in canonincal atom order - that is, two fragment IDs coincide if and only if the fragments represent the same molecular structure with corresponding cleavage data
- cleavage reaction IDs with atom maps for each atom, separated by semicolons (CutIds)
- the number of cleaved bonds for each atom, separated by semicolons (CutCounts)
- the total number of cleaved bonds (the sum of the cleaved bonds per atom) (CutSum)
- the multiplicity of the fragment (Count): the number of fragments with the same molecular structure and same unique fragment ID (Uid) found in the fragmentation of the input molecule
- the fragment set indices of these identical fragments, separated by commas (FragmentSets)
- the index or ID of the input molecule in the input file
Example
Apply the RECAP cleavage revision algorithm for the ether and amine cleavage reactions:
![]() |
![]() |
Note, that usually the application of these rules will not result in
single oxygen or nitrogen atom fragments: setting the
MinAtomCount RECAP parameter
to a value greater than 1 will prevent Fragmenter from
creating single-atom fragments.
Take the input molecule:
![]() |
Then the following fragments will be generated if bond cleavage between ring-carbons and hetero atoms (see revision rule 2. above) is forbidden (FragmenterRecap1.xml):
fragment -c FragmenterRecap1.xml input.mol -y 1 -f sdf:-a
![]() |
while one more cleavage is performed if bond cleavage between ring-carbons ant hetero atoms (see revision rule 2. above) is allowed (FragmenterRecap2.xml):
fragment -c FragmenterRecap2.xml input.mol -y 1 -f sdf:-a
![]() |
Note, that we set sdf:-a as output format in the -f parameter
because our fragments are aromatized due to standardization,
but the SDF format is supposed to store the dearomatized form.
A set of working examples is also available.
Usage
fragment -c <config file> [<options>] [<input files/strings>]
Prepare the usage of the fragment script or batch file
as described in Preparing the Usage of JChem
Batch Files and Shell Scripts.
Alternatively, the Fragmenter class can be directly invoked:
Win32 / Java 2 (assuming that JChem is installed in c:\jchem):
java -cp "c:\jchem\lib\jchem.jar;%CLASSPATH%" \
chemaxon.fragmenter.Fragmenter \
-c <config file> [<options>] \
[<input files/strings>]
Unix / Java 2 (assuming that JChem is installed in /usr/local/jchem):
java -cp "/usr/local/jchem/lib/jchem.jar:$CLASSPATH" \
chemaxon.fragmenter.Fragmenter \
-c <config file> [<options>] \
[<input files/strings>]
Options
General Options:
-h, --help this help message
-c, --config <filepath> configuration XML file
-x, --fragment-count <count> the maximum number of fragments
per fragment set (default: unlimited)
-y, --set-count <count> the maximum number of fragment sets
per molecule (default: unlimited)
-e, --extensive include extendable cut sets
-i, --id <SDF tag> SDFile tag that stores the molecule ID
(default: the molecule index)
-s, --statistics <SDF tag> SDFile tag that stores data used later
for making statistics (default: none)
-g, --ignore-error continue with next molecule on error
Output Options:
-f, --format <format> output file format (default: cxsmiles)
-o, --output <filepath> output file path (default: stdout)
-k, --skip-unfragmented skip unfragmented molecules in output
-d, --data <N|L|I|LI> fragment cut-data:
N: no data stored
L: in atom labels
I: in fragment ID
LI: in both atom labels and fragment ID
(default: LI)
-p, --attachment-point <N|S|A> fragment attachment points:
N: no marker atoms
S: denoted by any-atom (star) markers
A: denoted by Al and Ar atom markers
(Al: aliphatic, Ar: aromatic)
(default: N)
The command line parameter --config is mandatory. This
specifies the path and filename of a configuration file without which the
program cannot operate. A detailed description of the format of this
configuration file is given below.
The command line parameter --fragment-count
specifies the maximum number of fragments to be generated in one fragment set.
This parameter overrides the MaxFragmentCount
attribute specified in the configuration XML.
The command line parameter --set-count
specifies the maximum number of fragment sets to be generated per molecule.
This parameter overrides the MaxSetCount
attribute specified in the configuration XML.
If the command line parameter --extensive is specified then
fragment sets originating from cut sets that can be extended by adding more cleavage bonds
are also added to the result fragments.
This parameter overrides the Extensive attribute specified
in the configuration XML.
If the command line parameter --skip-unfragmented is specified then
unfragmentable molecules will not be written to the output.
By default, cleavage data is stored and visualized in the atom labels and associated
fragment IDs (these are used during fragment duplicate check in
FragmentStatistics).
Both can be changed by specifying the command line parameter --data:
N: fragmentation data is not storedL: data written in atom labelsI: fragment ID is writtenLI: data is written to both atom labels and fragment IDs (default)
- in the SDF tag specified in the
CutIdsattribute of theSDFTagselement in the configuration XML - in atom labels
- in fragment ID - in the
UIDSDF tag or incxsmilesfield
The first method is only available for SDF / MRV output. Atom labels are also useful for visualizing cleavage data at each atom, however, they may be disturbing if there are long cleavage data strings or atom labels are used for a different purpose.
The command line parameter --attachment-point determines whether attachment points
are represented by specific marker atoms. These atoms can be either any-atoms or Al and
Ar atoms depending on whether their substituents are aliphatic or aromatic in the input
molecule. If marker atoms are added, then they are connected with the matching target bond and cleavage
data is written at both end-atoms of the connection bond. No marker atoms are added by default.
The command line parameter --id specifies the SDF tag storing
the molecule ID to be written to the output SDF as reference to the source molecule
that the fragment has been generated from.
The command line parameter --statistics
specifies an SDFile tag name that stores some of the input molecule: this can be a
real number from a continuous range or a value from a discrete range, depending on
the type of the data. This data is copied into the fragments and will be used by
FragmentStatistics when counting fragments
falling into a certain data class (which is a user-defined interval in a
continuous range and a single value in a discrete range). Without this data field,
FragmentStatistics simply counts identical fragments.
Note, that FragmentStatistics requires fragments
written in cxsmiles format, therefore do not change the default output format
of Fragmenter when FragmentStatistics is to be run.
For details on making statistical data on fragments generated by Fragmenter,
refer to the FragmentStatistics Manual.
If the command line parameter --ignore-error is specified, then import/export errors
will not stop the processing but the error is written to the console and the molecule is skipped.
By default, the program exits in case of molecule import/export erros.
Input
Most molecular file formats are accepted ( MDL molfile, Compressed molfile, SDfile, Compressed SDfile, SMILES, etc.).
If no input file name or input string is specified in the command line then input is taken from the standard input.
Output
By default, Fragmenter writes output molecules in cxsmiles format with the following fields:
- SMILES string
- atom labels storing fragment cleavage data
- unique ID (used for fragment duplicate check within one input molecule)
- input molecule data read from SDFile tag (only if the command line parameter
--statisticsis specified) - fragment count (number of identical fragments within one input molecule)
Other output formats can be specified in the --format parameter.
Note, that FragmentStatistics requires fragments
written in cxsmiles format.
The --output parameter specifies the output file path.
If omitted, results are written to the standard output.
Configuration
The cleavage reactions are determined by the configuration file
(specified following the --config mandatory command line parameter).
An optional standardization section can be provided to perform pre-standardization on reaction reactants, products and input molecules. See the Standardizer manual for information on standardization.
The configuration XML may also specify reviser algorithm parameters in a separate section. If this section is omitted then no cleavage revision is made, that is, all bonds matching the cleavage reaction cleavage bonds are cleaved. Currently only the RECAP algorithm is implemented, therefore there are only two options:
- either no reviser is specified and all cleavage bonds matching the reactions are accepted,
- or else RECAP parameters are specified in which case the RECAP cleavage revision rules are applied.
The cleavage reactions are given in <Action>
subsections under the <Fragmenter> section.
Each reaction has an ID attribute and a
Structure attribute as well as an
optional Type attribute which specifies whether the
Structure attribute is a file path (Type="path")
or a molecule string (Type="string"). More actions can have the same
ID attribute; in this way alternate reaction definitions
may be specified for one cleavage rule (see the ether definitions
below).
If the Type attribute is omitted then the structure type is
automatically decided based on its format which gives the correct result
in most cases.
For a description of reaction mapping, see the Reaction mapping section of the Reactor Manual.
Unlike in case of usual reaction definitions, here atom maps do not have to be unique: identical atom maps denote symmetric atoms (see the ether and amine reactions in the introduction example). In theID attribute together with the matching reaction
atom map is written in the fragment SDF tag to identify the cleavage bond endpoint.
The SDFTags section specifies which
cleavage data should be stored in fragment SDF tags and specifies
these SDF tag names (if the attribute is omitted then the corresponding data will not
be stored).
The Fragmentation section
specifies the following fragmentation parameters:
- the maximum number of fragments to be generated per fragment set
- the maximum number of fragment sets to be generated per molecule
- the extensive fragmentation option
An optional cleavage bond reviser algorithm implementation may be applied with
parameters listed under the <Reviser> section.
The implementation java class is specified in the <Class>
attribute. Reviser specific parameters are specified in sunsections of the reviser
section. Currently only the RECAP reviser algorithm is available.
RECAP parameters that are specified in subsections are:
- The
NotList: the list of forbidden fragments. Molecules are specified theStructureand the optionalTypeattribute similarly to reaction definitions. - Limits:
- Options:
CutRingCHetero: "true" if cleavage between ring carbons and hetero atoms is allowed, "false" otherwise (default: "false")
Example
<FragmenterConfiguration Version ="0.1" schemaLocation="fragment_schema.xsd"> <Standardizer> <Actions> <Reaction ID="plusminus" Structure="[*+:1][*-:2]>>[*:1]=[*:2]"/> <Action ID="aromatize" Act="aromatize"/> </Actions> </Standardizer> <Fragmenter> <Actions> <Action ID="amide" Structure="[O:3]=[C!$(C([#7])(=O)[!#1!#6]):2]-[#7!$([#7][!#1!#6]):1]>>[O:3]=[C:2].[#7:1]"/> <Action ID="ester" Structure="[#6!$([#6](O)~[!#1!#6])][O:2][C:1]=O>>[C:1]=O.[#6][O:2]"/> <Action ID="amine" Structure="[#6:2]-[N!$(N[#6]=[!#6])!$(N~[!#1!#6])!X4:1]>>[N:1].[#6:2]"/> <Action ID="urea" Structure="N[C:1]([N:2])=O>>N[C:1]=O.[N:2]"/> <Action ID="ether" Structure="[#6]-[O!$(O[#6]~[!#1!#6]):1]-[#6:2]>>[#6:2].[O:1]-[#6]"/> <Action ID="olefin" Structure="[C:1]=[C:1]>>[C:1].[C:1]"/> <Action ID="quatN" Structure="[#6:1]-[N$(N([#6])([#6])([#6])[#6])!$(NC=[!#6]):2]>>[#6:1].[N:2]"/> <Action ID="aromN-carbon" Structure="[n:1]-[#6!$([#6]=[!#6]):2]>>[n:1].[#6:2]"/> <Action ID="lactamN-carbon" Structure="[C:3](=[O:4])@-[N:1]!@-[#6!$([#6]=[!#6]):2]>>[C:3](=[O:4])[N:1].[#6:2]"/> <Action ID="aromcarbon-aromcarbon" Structure="[c:1]-[c:1]>>[c:1].[c:1]"/> <Action ID="sulphonamide" Structure="[#7:1][S:2](=O)=O>>[#7:1].[S:2](=O)=O"/> </Actions> <Params> <SDFTags CutIds="REACTIONS" CutCounts="COUNTS" CutSum="SUM" Count="COUNT" FragmentSets="FRAGMENTSETS"/> <Fragmentation MaxFragmentCount="3" MaxSetCount="30" Extensive="false"/> </Params> </Fragmenter> <Reviser> <Recap Class="chemaxon.fragmenter.Recap"> <Notlist> <Mol ID="butyl" Structure="CCCC"/> <Mol ID="ibutyl" Structure="CC(C)C"/> </Notlist> <Params> <Limits MaxCutCount="4" MinAtomCount="4"/> <Options CutRingCHetero="false"/> </Params> </Recap> </Reviser> </FragmenterConfiguration>
Examples
- Fragments structures from the
mols.sdffile and writes the molecule fragments to the standard output in cxsmiles format:fragment -c Fragmenter.xml mols.sdf
- The same with SMILES string input:
fragment -c Fragmenter.xml "CC(CCN(C)COCCC1=CC=CC=C1C2=CC=CC=C2)COCN" "CCCCN(C)C(C(=O)C1CCCC(Cl)C1)C(C)C(Cl)Cl"
- Applies extensive fragmentation:
fragment -c Fragmenter.xml -e mols.sdf
- Creates maximum
5fragment sets and maximum4fragments in each fragment set:fragment -c Fragmenter.xml -x 5 -y 4 mols.sdf
- Performs fragmentation and writes fragments to
o.sdf, then displays the result in MarvinView:fragment -c Fragmenter.xml mols.sdf -f sdf -o o.sdf mview o.sdf
- The same but directly pipes output to MarvinView:
fragment -c Fragmenter.xml mols.sdf -f sdf | mview -
Note that such piping does not work in Windows.
- Prepares simple fragment statistics
(counts fragments and sorts the result by occurrences):
fragment -c Fragmenter.xml mols.sdf -o fragments.cxsmiles fragstat fragments.cxsmiles
Or in a single command:
fragment -c Fragmenter.xml mols.sdf | fragstat
Note that such piping does not work in Windows.
- Prepares fragment statistics based on
molecule data stored in the
DATASDFile tag of the input molecules (counts fragments by data classes and sorts the result by occurrences):fragment -c Fragmenter.xml -s DATA mols.sdf -o fragments.cxsmiles fragstat -c "0.2 0.5" fragments.cxsmiles
Or in a single command:
fragment -c Fragmenter.xml -s DATA mols.sdf | fragstat -c "0.2 0.5"
Note that such piping does not work in Windows.
Notes
Appropriate fragmentation parameter settings can be used to avoid combinatorial explosure:
- the maximum number of fragments should be set to a
sufficiently small value (e.g. 3 or 4). This limit can be set either in the
configuration file in the MaxFragmentCount
fragmentation attribute or else in the command line parameter
--fragment-count - the maximum number of fragment sets should be set to a bigger
value (e.g. 30 or 40). This limit can be set either in the
configuration file in the MaxSetCount
fragmentation attribute or else in the command line parameter
--set-count
- the maximum number of fragments should be set to a
sufficiently small value (e.g. 3 or 4). This limit can be set either in the
configuration file in the MaxFragmentCount
fragmentation attribute or else in the command line parameter
Fragment repetition is detected by first comparing the unique ID of the two fragments, then if the two IDs are the same, structure search is performed to test exact molecular structure matching. Note that this duplicate check is performed for each input molecule separately, therefore duplicated fragments may occur if they correspond to different input molecules. For a complete duplicate check with fragment sorting based on occurrences, create Fragment Statistics from the fragments created by Fragmenter.
References
RECAP - Retrosynthetic Combinatorial Analysis Procedure: A Powerful New Technique for Identifying Privileged Molecular Fragments with Useful Applications in Combinatorical Chemistry In: J. Chem. Inf. Comput. Sci. 1998, 38. 511-522
Do you have a question? Would you like to learn more? Please browse among the related topics on our support forum or search the website. If you want to suggest modifications or improvements to our documentation email our support directly!





