GenerFP
Version 5.9.4
Notice: This program is obsolete. Please use GenerateMD instead.
GenerFP is a Java program for generating binary fingerprints.
Contents
What are chemical hashed fingerprints?
A chemical hashed fingerprint is a bit string generated from the structure of a molecule. The number of different bits in the fingerprint of similar molecules is lower than in the case of dissimilar ones. GenerFP uses a hashing algorithm to generate fingerprints from molecules that can be used for- quantifying the similarity of molecules
- predicting the diversity of a set of compounds
- searching substructures in molecules.
Requirement
The GenerFP program is written in Java. To run GenerFP Java Virtual Machine 1.1.6 or later is needed.
How to use GenerFP?
Usage
generfp [<options>] <inputfile >outputfile
Prepare the usage of the generfp script or batch file
as described in
Preparing the Usage of JChem Batch Files and Shell
Scripts.
Or call the GenerFP class directly:
Win32 / Java 2 (assuming that JChem is installed in c:\jchem):
java -cp c:\jchem\lib\jchem.jar chemaxon.sss.screen.GenerFP [<options>] <inputfile >outputfile
Unix / Java 2 (assuming that JChem is installed in /usr/local/jchem):
java -cp /usr/local/jchem/lib/jchem.jar chemaxon.sss.screen.GenerFP [<options>] <inputfile >outputfile
Options
-h display this help and exit
-fl <length> fingerprintlength length in bytes (default: 64)
-pl <length> maximum number of bonds in patterns (default: 5)
-bc <bitcount> number of bits to switch on for each pattern
in the structure (default:2)
-f<format> format of the output
-fb binary
-f1 ones and zeros (001011011...) (default)
-fd bytes in decimal format
-fh bytes in hexadecimal format
-fi integers in decimal format
-stat generate statistics
-s <separator> separator in the output in case of text output
Separators:
n no separator
c comma (default)
t tab
s space
Input file formats
How does it work?
For each atom of the molecule the following algorithm is performed:- Finding all path from the given atom, which can be accessed through 0, 1, 2,...<length> bonds in the structure.
- Generation bit sets for paths.
- Add the bit sets to the fingerprint using logical OR operation.
The -pl <length> option sets the maximum length of path.
For each of the previously generated paths, a bit array is generated with fixed length and fixed number of bits turned on.
The -fl <length> option sets the fixed length of the bit array.
The -bc <bit count> option sets the number of bits to turn on for each path.
Generating fingerprint for CH3-CH2-OH
- Find the paths up to 3 bonds:
- Generate a bit array with length 10 and 2 bits turned on:
- Add these bit arrays to the fingerprint:
For the 1st C atom:
| 0 - bond | C |
| 1 - bond | C-H,C-C |
| 2 - bond | C-C-H,C-C-O |
| 3 - bond | C-C-O-H |
And for all the other atoms...
| path | bit array |
| C | 1010000000 |
| C-H | 0001010000 |
| C-C-H | 0001000010 |
| C-C-O | 0100010000 |
| C-C-O-H | 0000001001 |
And for all the other paths...
| 0000000000 | initial fingerprint |
| 1010000000 | 1-st path |
| 0001010000 | 2-nd path |
| 0001000010 | 3-rd path |
| 0100010000 | 4-th path |
| 0000001001 | 5-th path |
| 1111011011 |
|
Continue for all the other bit arrays....
The partial result is: 1111011011 .
Examples
The following examples work under Unix.generfp.sh -hgenerfp.sh <input.mol >out.txtgenerfp.sh -fl 64 -bc 1 <input.sdf >out.txtgenerfp.sh -fh -s s <input.mol >out.txtgenerfp.sh -stat <molecules.sdf- Average number of 1 bits (the average number of 1 bits in the fingerprint)
- Maximum number of 1 bit and the id. number of the molecule at the maximum
- Minimum number of 1 bits and the id. number of the molecule at the minimum
- Density function
Shows the help screen.
This generates the fingerprint for the input molfile with the default
settings:
Fingerprint length: 64 byte
Number of bits turned on for each pattern: 2
Maximum number of bonds to use for generate patterns: 5
Output format: ones and zeros
The resultout.txtis a text file.
out.txt:
0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,...
(64*8 number)
The above line generates 64 byte long fingerprints using patterns containing maximum 5 bonds, for each molecule in the input file. 1 bit will be generated for each pattern in the hashing procedure.
out.txt:
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...
(64*8 number)
Generates the fingerprint for the molfile with the default settings. The result is in hexadecimal format separated by spaces in the out.txt file.
out.txt:
8 0 6 0 8 a 0 4 6 8 0 0 0 0 1 0 0 0 ...
(64*8 char)
Generates statistic for the input structure file indicating the
output:
Number of bits(1) set on:
Average: 7.12%
Maximum: 51.89% (at molecule 1206)
Minimum: 0.20% (at molecule 1201)
Density function:
0%-10% 79.53%
10%-20% 16.80%
20%-30% 2.58%
30%-40% 0.47%
40%-50% 0.57%
50%-60% 0.05%
60%-70% 0.00%
70%-80% 0.00%
80%-90% 0.00%
90%-100% 0.00%
Do you have a question? Would you like to learn more? Please browse among the related topics on our support forum or search the website. If you want to suggest modifications or improvements to our documentation email our support directly!
