Synthesizer Examples

Contents

 

Introduction

These examples demonstrate the use of the Synthesizer program. The purpose is to show how to run the synthesize command and explain some of its command line options as well as its configuration.

Synthesizer uses Reactor to perform a synthesis step. For a description of reaction definitions and mapping, see the Reaction definitions section of the Synthesizer manual and Reaction mapping section of the Reactor manual.

 

Prerequisites

These examples run the synthesize UNIX shell script under UNIX / Linux or the synthesize.bat batch file under Windows.

To run these examples:

  1. The Java Virtual Machine version 1.4 or higher and JChem have to be installed on your system.

  2. The PATH (all systems) and the JCHEMHOME (under Windows) environment variables have to be set as described in the Preparing and Running JChem's Batch Files and Shell Scripts manual.

  3. For the synthesis example in database mode a database connection is required. Its connection settings should be saved in the .jchem file under the .chemaxon (UNIX / Linux) or chemaxon (Windows) directory under your user home directory. To set up and save these settings, use the database connection dialog coming up when starting JChemManager (jcman). After exiting JChemManager your settings will be saved properly.

  4. A command shell (under UNIX / Linux: your favorite shell, under Windows: a Cygwin shell or a Command Prompt) has to be run in the synthesizer example directory.
    In UNIX / Linux:
    cd jchem/examples/synthesizer
    
    In Windows:
    cd jchem\examples\synthesizer
    
 

Linear algorithm example

We will show a simplified version of a combinatorial synthesis (see [1] and [2]). This example synthesis is an application of the linear algorithm. The same synthesis will be performed in memory mode, file mode and database mode.

To run the examples, step into the linear subdirectory:

cd linear

The synthesis configuration is stored in Synthesizer.xml. Click here for graphical representation of the synthesis graph defined in Synthesizer.xml. You can look at the edges corresponding to the different synthesis steps by clicking on the steps below. The synthesis consists of the following 3 synthesis steps:

  1. Step 1

    Alkynes react with the scaffold taken as an alkyl-halide according to the following reaction:

    alkyne + alkyl-halide

  2. Step 2

    Products created in the previous step react with amines according to the following reaction:

    gamma-lacton + amine

  3. Step 3

    Carboxylic-acids react with the products created in the previous step according to the following reaction:

    carboxylic-acid + alcohol

The products created in the last step are put into the RESULT set of the synthesis. For a description on reaction definitions see the Reaction definitions section of the Reactor Manual. Reaction conditions are shown in MView after setting Table / Show Fields.

In addition to the reaction rules specified in the Reaction definitions, our synthesis steps also have a synthesis condition which determines whether the created products are relevant to the synthesis. These conditions are specified in the <Rule> subsections of our Synthesizer.xml configuration XML. We have the same condition for all synthesis steps:

Ncount(product(0)) + Ocount(product(0)) + Scount(product(0)) <= 10 && mass(product(0)) <= 700

which means that we accept a product only if the total number of nitrogen, oxygen and sulfur atoms is at most 10 and the molecule mass is at most 700.

For comparison, we will run the synthesis examples below without this rule. The configuration XML Synthesizer_norule.xml will be used in this case. The only difference from the original Synthesizer.xml configuration is, that there are no <Rule> subsections.

All synthesis steps have the same parameters set in the <Step> sections: the Mode attribute is set to "comb", while the Type attribute is set to "all". This means that all of the above reactions are performed in combinatorial mode with all reaction centers processed.

We run Synthesizer with the Unique option that can be set as a global synthesis parameter in the Params section of the configuration XML. This means that product repetitions are filtered by the Reactor in each step.

Our scaffold molecule is stored in scaffold.smiles and shown below:

scaffold

An alternative scaffold used in the original synthesis described in [1] and [2] is stored in scaffold.mol (see Note 1).

Our additional input molecules are stored in alkynes.smiles, amines.smiles, carboxylic-acids.smiles and shown below:

alkynes

amines

carboxylic-acids

 

Memory mode

In memory mode Synthesizer uses memory based molecule sets which means that all molecules are stored in memory. Although the synthesis is the fastest with this storage, the serious limitation is that only a couple of thousands of molecules can be stored in the memory at a time (5-6000 by default, can be increased by setting the -Xmx option of the java VM). This is not sufficient in most cases.

Run Synthesizer in memory mode by the m synthesis command:

synthesize m -c Synthesizer.xml -s scaffold scaffold.smiles -s alkyne alkynes.smiles -s amine amines.smiles -s carboxylic-acid carboxylic-acids.smiles -t SET -f sdf -o result_mem.sdf

The above command runs Synthesizer in memory mode, uses the configuration stored in Synthesizer.xml, fills the input sets scaffold, alkyne, amine, carboxylic-acid with molecules stored in scaffold.smiles, alkynes.smiles and amines.smiles, and carboxylic-acids.smiles, resp., finally outputs all synthesized molecules (together with the inputs and intermediate products) to result_mem.sdf with the molecule set ID stored in the SET SDF tag.

Note, that the SDF tag field_0 stores the origin code sequence used by the Synthesis Browser to color the atoms according to their originating molecule sets. Currently this is available only in database mode.

Some sample molecules from the synthesis result file result_mem.sdf:

molecules generated in memory mode

For comparison, run Synthesizer without the synthesis rules by using the configuration XML Synthesizer_norule.xml:

synthesize m -c Synthesizer_norule.xml -s scaffold scaffold.smiles -s alkyne alkynes.smiles -s amine amines.smiles -s carboxylic-acid carboxylic-acids.smiles -t SET -f sdf -o result_mem_norule.sdf

Some sample molecules from the synthesis result file result_mem_norule.sdf:

molecules generated in memory mode with no synthesis rule

You can see the complete list of memory-mode options by typing:

synthesize m -h

The use and meaning of command line options used in this example:

Some memory-mode synthesis options
OptionDescriptionDefault
-c configuration file -
-s molecule set ID with input file -
-t SDF tag storing the molecule SET ID the SET ID is not stored
-f specifies the output format (e.g. 'sdf', 'mol') 'smiles'
-o specifies the output file path standard output (console)
 

File mode

In file mode Synthesizer uses file based molecule sets which means that all molecules are stored in a separate file. The file format is specified in the -f command line option (default: SMILES). These molecule set files are placed in a subdirectory of the current directory. The name of the directory is the same as the synthesis name specified in the -n mandatory command line option. Synthesis set files are placed into this newly created directory with file names being the molecule set IDs.

Compared to memory mode, file mode is slower but does not have the serious limitation on the number of molecules. Molecule set files are kept after the synthesis has finished.

Run Synthesizer in file mode by the f synthesis command. Use synthesis name syn (-n syn), SMILES molecule sets (-f smiles) (but this is the default molecule set format anyway), do not require output other than the molecule sets themselves (-m):

synthesize f -n syn -c Synthesizer.xml -s scaffold scaffold.smiles -s alkyne alkynes.smiles -s amine amines.smiles -s carboxylic-acid carboxylic-acids.smiles -f smiles -m

The synthesis sets can be found in the syn subdirectory:

If there exists a subdirectory syn already (e.g. because you have previously run the synthesis in file mode) then a new subdirectory name is generated by appending a random number to syn.

The generated molecules alltogether are the same as the molecules generated in memory mode.

For comparison, run Synthesizer without the synthesis rules by using the configuration XML Synthesizer_norule.xml:

synthesize f -n syn_norule -c Synthesizer_norule.xml -s scaffold scaffold.smiles -s alkyne alkynes.smiles -s amine amines.smiles -s carboxylic-acid carboxylic-acids.smiles -f smiles -m

The synthesis sets can be found in the syn_norule subdirectory:

You can see the complete list of file-mode options by typing:

synthesize f -h

The use and meaning of command line options used in this example:

Some file-mode synthesis options
OptionDescriptionDefault
-c configuration file -
-s molecule set ID with input file -
-n the synthesis directory name -
-f specifies the synthesis set file format (e.g. 'sdf', 'mol') 'smiles'
-m output only the molecule set files output all molecules in an output file/stream
apart from the generated molecule sets
 

Database mode

In database mode Synthesizer stores molecule sets in database. Note, that you should have a properly configured database connection to run Synthesizer in database mode. The main advantage of the database mode is that you can browse the molecules in the Synthesis Browser, view the structures colored according to origin codes or view the corresponding synthesis path. Molecules are stored in a regular JChem structure table, synthesis data (origin code, synthesis set ID, synthesis path) is stored in separate custom tables.

In database mode we use separate commands for creating a synthesis (command c), importing molecules to synthesis sets (command i) and running the synthesis (command r). The Synthesizer Manual contains the complete list of available commands. You can see the complete list of command specific options by:

synthesize <command> -h

For example, type

synthesize i -h

to display import options.

  1. First create the synthesis by command c: specify the synthesis name in the -n option and the synthesis configuration file in the -c option:
    synthesize c -n syn -c Synthesizer.xml
    

    The synthesis name is used to identify the synthesis in the Synthesis Browser as well as it is used as a basis for synthesis table names.

    If you already have a synthesis with name syn then either choose a different synthesis name, or else delete the synthesis by command d before creating it as shown above:

    synthesize d -n syn
    synthesize c -n syn -c Synthesizer.xml
    

  2. Then import molecules into the molecule sets by command i:
    synthesize i -n syn -s scaffold scaffold.smiles
    synthesize i -n syn -s alkyne alkynes.smiles
    synthesize i -n syn -s amine amines.smiles
    synthesize i -n syn -s carboxylic-acid carboxylic-acids.smiles
    

  3. Finally, run the synthesis by command r:
    synthesize r -n syn
    

Duplicate structure filtering can be switched on by the -q option for molecule import (command i) and for the synthesis process (command r). In this case the JChem structure table will contain unique structures while you will still see duplicates in the Synthesis Browser, because these molecule duplicates may have been created through different synthesis paths (in our case molecule duplicates with the same synthesis path are filtered by the Unique synthesis option set in the <Params> section of our configuration XML Synthesizer.xml). Note, that duplicate structure filtering is time consuming and makes the synthesis process much slower.

You can export molecules to file command e:

synthesize e -n syn -s RESULT -f sdf -o result_db.sdf

The RESULT set is exported to result_db.sdf with origin codes stored in the field_0 SDF tag. The RESULT molecules are shown below:

the RESULT molecules

Now that you have synthesized molecules in the database, you can use the Synthesis Browser to see them:

For comparison, you can run the synthesis without the synthesis rules by using the configuration XML Synthesizer_norule.xml:

synthesize c -n syn_norule -c Synthesizer_norule.xml
synthesize i -n syn_norule -s scaffold scaffold.smiles
synthesize i -n syn_norule -s alkyne alkynes.smiles
synthesize i -n syn_norule -s amine amines.smiles
synthesize i -n syn_norule -s carboxylic-acid carboxylic-acids.smiles
synthesize r -n syn_norule

Export the RESULT set to result_db_norule.sdf by:

synthesize e -n syn_norule -s RESULT -f sdf -o result_db_norule.sdf

Some sample molecules from the RESULT set that were previously excluded by the synthesis rules are shown below:

Sample molecules from the RESULT set when no synthesis rules are applied

 

Exhaustive algorithm example

This example synthesis is an application of the exhaustive algorithm. The example shows a virtual emulation of the aerobic bacterial biodegradation of phenol. Two alternative pathways of the oxygenolytic ring cleavage reactions of catechol are catalyzed by specific dioxygenases. Both pathways may be present in one bacterial species. Refer to [3] for a detailed description of this mechanism.

The same synthesis will be performed in memory mode, file mode and database mode.

To run the examples, step into the exhaustive subdirectory:

cd exhaustive

The synthesis configuration is stored in Synthesizer.xml. Note, that synthesis graph consists of only one set since we want to generate the products along all reaction sequences which means that intermediate products should be taken as reactants in all reactions in the same way as input molecules.

Our input molecule (phenol) is shown below:

input molecule (phenol)

The synthesis consists of the following synthesis steps:

 

Memory mode

In memory mode Synthesizer uses memory based molecule sets which means that all molecules are stored in memory. Although the synthesis is the fastest with this storage, the serious limitation is that only a couple of thousands of molecules can be stored in the memory at a time (5-6000 by default, can be increased by setting the -Xmx option of the java VM). This is not sufficient in most cases.

Run Synthesizer in memory mode by the m synthesis command:

synthesize m -s S1 phenol.smiles -c Synthesizer.xml -o metabolites.smiles

The generated metabolites are shown below:

metabolites

 

File mode

In file mode Synthesizer uses file based molecule sets which means that all molecules are stored in a separate file. The file format is specified in the -f command line option (default: SMILES). These molecule set files are placed in a subdirectory of the current directory. The name of the directory is the same as the synthesis name specified in the -n mandatory command line option. Synthesis set files are placed into this newly created directory with file names being the molecule set IDs.

Compared to memory mode, file mode is slower but does not have the serious limitation on the number of molecules. Molecule set files are kept after the synthesis has finished.

Run Synthesizer in file mode by the f synthesis command:

synthesize f -n biodegradation -s S1 phenol.smiles -c Synthesizer.xml -o metabolites.smiles

The result is the same set of metabolites as in memory mode, but the molcules are also stored in the molecule set file S1 together with origin codes in the biodegradation subdirectory. If there exists a subdirectory biodegradation already (e.g. because you have previously run the synthesis in file mode) then a new subdirectory name is generated by appending a random number to biodegradation.

 

Database mode

In database mode Synthesizer stores molecule sets in database. Note, that you should have a properly configured database connection to run Synthesizer in database mode. The main advantage of the database mode is that you can browse the molecules in the Synthesis Browser, view the structures colored according to origin codes or view the corresponding synthesis path. Molecules are stored in a regular JChem structure table, synthesis data (origin code, synthesis set ID, synthesis path) is stored in separate custom tables.

In database mode we use separate commands for creating a synthesis (command c), importing molecules to synthesis sets (command i) and running the synthesis (command r). The Synthesizer Manual contains the complete list of available commands. You can see the complete list of command specific options by:

synthesize <command> -h

For example, type

synthesize i -h

to display import options.

  1. First create the synthesis by command c: specify the synthesis name in the -n option and the synthesis configuration file in the -c option:
    synthesize c -n biodegradation -c Synthesizer.xml
    

    The synthesis name is used to identify the synthesis in the Synthesis Browser as well as it is used as a basis for synthesis table names.

    If you already have a synthesis with name syn then either choose a different synthesis name, or else delete the synthesis by command d before creating it as shown above:

    synthesize d -n biodegradation
    synthesize c -n biodegradation -c Synthesizer.xml
    

  2. Then import the input molecule into set S1 by command i:
    synthesize i -n  biodegradation -s S1 phenol.smiles
    

  3. Finally, run the synthesis by command r:
    synthesize r -n biodegradation
    

  4. You can export molecules to file by command e:
    synthesize e -n biodegradation -s S1 -o metabolites.smiles
    

Now that you have synthesized molecules in the database, you can use the Synthesis Browser to see them: