GenerFP

Version 5.9.4

Notice: This program is obsolete. Please use GenerateMD instead.

GenerFP is a Java program for generating binary fingerprints.

Contents

 

What are chemical hashed fingerprints?

A chemical hashed fingerprint is a bit string generated from the structure of a molecule. The number of different bits in the fingerprint of similar molecules is lower than in the case of dissimilar ones. GenerFP uses a hashing algorithm to generate fingerprints from molecules that can be used for
  • quantifying the similarity of molecules
  • predicting the diversity of a set of compounds
  • searching substructures in molecules.
 

Requirement

The GenerFP program is written in Java. To run GenerFP Java Virtual Machine 1.1.6 or later is needed.

 

How to use GenerFP?

Usage

    generfp [<options>] <inputfile >outputfile

Prepare the usage of the generfp script or batch file as described in Preparing the Usage of JChem Batch Files and Shell Scripts.

Or call the GenerFP class directly:

Win32 / Java 2 (assuming that JChem is installed in c:\jchem):

    java -cp c:\jchem\lib\jchem.jar chemaxon.sss.screen.GenerFP [<options>] <inputfile >outputfile

Unix / Java 2 (assuming that JChem is installed in /usr/local/jchem):

    java -cp /usr/local/jchem/lib/jchem.jar chemaxon.sss.screen.GenerFP [<options>] <inputfile >outputfile

Options

  -h               display this help and exit
  -fl <length>     fingerprintlength length in bytes (default: 64)
  -pl <length>     maximum number of bonds in patterns (default: 5)
  -bc <bitcount>   number of bits to switch on for each pattern
                    in the structure (default:2)
  -f<format>       format of the output
     -fb              binary
     -f1              ones and zeros  (001011011...) (default)
     -fd              bytes in decimal format
     -fh              bytes in hexadecimal format
     -fi              integers in decimal format

  -stat            generate statistics

  -s <separator>   separator in the output in case of text output
      Separators:     
	    n no separator
            c comma (default)
            t tab
            s space

Input file formats

 

How does it work?

For each atom of the molecule the following algorithm is performed:
  1. Finding all path from the given atom, which can be accessed through 0, 1, 2,...<length> bonds in the structure.

  2. The -pl <length>  option sets the maximum length of path.
  3. Generation bit sets for paths.

  4. For each of the previously generated paths, a bit array is generated with fixed length and fixed number of bits turned on.
    The -fl <length> option sets the fixed length of the bit array.
    The -bc <bit count> option sets the number of bits to turn on for each path.
  5. Add the bit sets to the fingerprint using logical OR operation.

Generating fingerprint for CH3-CH2-OH

  • Find the paths up to 3 bonds:

  • For the 1st C atom:
    0 - bond C
    1 - bond C-H,C-C
    2 - bond C-C-H,C-C-O
    3 - bond C-C-O-H

    And for all the other atoms...
     

  • Generate a bit array with length 10 and 2 bits turned on:

  • path bit array
    C 1010000000
    C-H 0001010000
    C-C-H 0001000010
    C-C-O 0100010000
    C-C-O-H 0000001001

    And for all the other paths...
     

  • Add these bit arrays to the fingerprint:

  • 0000000000 initial fingerprint
    1010000000 1-st path
    0001010000 2-nd path
    0001000010 3-rd path
    0100010000 4-th path
    0000001001 5-th path
    1111011011
    partial result

    Continue for all the other bit arrays....

    The partial result is: 1111011011 .

 

Examples

The following examples work under Unix.
  • generfp.sh -h

  • Shows the help screen.

  • generfp.sh <input.mol >out.txt

  • This generates the fingerprint for the input molfile with the default settings:

    Fingerprint length: 64 byte
    Number of bits turned on for each pattern: 2
    Maximum number of bonds to use for generate patterns: 5
    Output format: ones and zeros
    The result out.txt is a text file.
    out.txt:
    0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,... (64*8 number)

  • generfp.sh -fl 64 -bc 1 <input.sdf >out.txt
  • The above line generates 64 byte long fingerprints using patterns containing maximum 5 bonds, for each molecule in the input file. 1 bit will be generated for each pattern in the hashing procedure.

    out.txt:
    1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,... (64*8 number)

  • generfp.sh -fh -s s <input.mol >out.txt
  • Generates the fingerprint for the molfile with the default settings. The result is in hexadecimal format separated by spaces in the out.txt file.

    out.txt:
    8 0 6 0 8 a 0 4 6 8 0 0 0 0 1 0 0 0 ... (64*8 char)

  • generfp.sh -stat <molecules.sdf
  • Generates statistic for the input structure file indicating the

    • Average number of 1 bits (the average number of 1 bits in the fingerprint)
    • Maximum number of 1 bit and the id. number of the molecule at the maximum
    • Minimum number of 1 bits and the id. number of the molecule at the minimum
    • Density function

    output:

    Number of bits(1) set on:
    Average: 7.12%
    Maximum: 51.89% (at molecule 1206)
    Minimum: 0.20% (at molecule 1201)
    Density function:
         0%-10%   79.53%
        10%-20%   16.80%
        20%-30%    2.58%
        30%-40%    0.47%
        40%-50%    0.57%
        50%-60%    0.05%
        60%-70%    0.00%
        70%-80%    0.00%
        80%-90%    0.00%
        90%-100%   0.00%
    

Do you have a question? Would you like to learn more? Please browse among the related topics on our support forum or search the website. If you want to suggest modifications or improvements to our documentation email our support directly!