Chemical Hashed Fingerprints

Contents

  1. Introduction
  2. Effect of Parameters
  3. Optimizing Parameters For Search Efficiency

1. Introduction

What is a Chemical Hashed Fingerprint?

The chemical hashed fingerprint of a molecule is bit string (a sequence of "0" and "1" digits) that contains information on the structure. Chemical hashed fingerprints are mostly used in the following areas:

At both applications a proper configuration of the fingerprint is very important:

ChemAxon provides the GenerateMD program for producing binary fingerprints that can be processed further. This program can also be applied to fine-tune fingerprint parameters for JChem.

The Process of Fingerprint Generation in JChem and GenerateMD

  1. Up to a given a bond number, all linear paths (linear patterns) consisting bonds and atoms of a structure are detected.
  2. All cycles (cyclic patterns) are also detected.
  3. Using a proprietary hashing method, a given number of bits in the bit string are set for each pattern. It is possible, that the same bit is set by multiple patterns. This phenomenon is called bit collision. Few bit collisions in the fingerprint is tolerable, but too many may result in losing information in the fingerprint.
The figure below shows this process on an example.

Chemical hashed fingerprint generation.

Terms Used in This Document

Fingerprint length
The number of bits in the bit string.
Maximum pattern length
The maximum length of atoms in the linear paths that are considered during the fragmentation of the molecule. (The length of cyclic patterns is limited to a fixed ring size.)
Bits to be set for patterns
After detecting a pattern, some bits of the bit string are set to "1". The number of bits used to code patterns is constant.
Darkness of the fingerprint
The percentage of "1" digits in the bit string. We consider fingerprints with more ones "darker" than those with less ones.

2. Effect of Parameters

Effect of Increasing the Fingerprint Length

It increases

In addition, it decreases fingerprint darkness, and therefore the probability of bit collisions also, which is beneficial.

A too long bit string may decrease the efficiency of information storage. We found that a length of 512 bits (64 bytes) worked well for small and huge databases as well. However, in similarity calculations longer fingerprints exhibit better performance in terms of selectivity (that is, distinguishing similar but not identical compounds). This is important in similarity based virtual screening as well as in similarity based clustering. In such applications 1024 bits usually provide better results.

Effect of Increasing Maximum Pattern Length

Substructure searching performs well with 5-6 long patterns. In similarity searching, however, longer patterns may be required, 7 is usually good value, and path longer than 8 seldom improve results. Also bear in mind that longer paths necessitates longer fingerprint to avoid too dark fingerprints.

Effect of Increasing the Number of Bits to Be Set for Patterns

Typical value is 1 or 2. Database pre-filtering for substructure searching doe snot require larger bit-count than 1, and this allows shorter fingerprints that is usually beneficial both in terms of storage space requirement and retrieval time. Though higher values could enable better separation of similar but not identical compounds thus leading to less frequent call of atom-by-atom matching in substructure searching, but only with the expense of doubled storage space and thus slower retrieval and more time consuming fingerprint comparison which is significantly more frequent procedure than the atom-by-atom searching.
Again, the situation is somewhat different in similarity searching, yet values higher than 2 rarely increase the amount of information represented by the fingerprint significantly (as the 3rd, 4th etc bits are more correlated with the other two, while 1st and 2nd are highly independent).

3. Optimizing Parameters For Search Efficiency

To choose optimal parameters for your compounds, running GenerateMD with the --stat option is recommended, or the use of JChem table or index statistics. (See more information in the JChem Manager command line usage (s command), at the Cartridge index statistics function and the Statistics tab at Instant JChem Schema editor.) These tools provide some practical information on the database (average/minimum/maximum "darkness", distribution, etc.).

Maximum darkness should not be higher than 80% (other sources/users say 2/3, ie. 67%). Otherwise, the information content of the individual fingerprint is decreased, and thus in similarity searching, for instance, similar though not identical compounds cannot be distinguished. Even a few too dark fingerprints also decrease screening efficiency at structure searching and consequently atom-by-atom search is unnecessarily often performed on the records with the dark fingerprints, even when target structures do not contain the given query structure.

The average darkness highly depends on the application and the particular data set (e.g. total diversity highly influences fingerprint darkness). In theory the information content is optimal at an average darkness of 50%, though in general, darkness should not exceed 40% to be on the safe side (to avoid frequent collisions).

The following statistics output shows an optimal fingerprint configuration, as generated by JChem table statistics function:

    Statistics for table: APP.SCREENING_COLLECTION
    --------------------
    Row count: 58850
    NULL SMILES count: 0
    Average SMILES length: 40.09
    Average compressed SMILES length: 20.34
    Markush structure count: 0 (0.0%)

    Fingerprint settings:

    Length (bits): 768
    Pattern length: 6
    Bits set per pattern: 2

    Min. CFP darkness: 4.03% cd_id: 26456
    Max. CFP darkness: 69.66% cd_id: 20757
    Avg. CFP darkness: 33.57%

    Chemical Fingerpint distribution:
    --------------------------------
    0% - 10% : 0.6 %
    10% - 20% : 8.54 %
    20% - 30% : 28.8 %
    30% - 40% : 34.92 %
    40% - 50% : 21.42 %
    50% - 60% : 5.27 %
    60% - 70% : 0.42 %
    70% - 80% : 0.0 %
    80% - 90% : 0.0 %
    90% - 100% : 0.0 %

The following graphs show the dependence of

on two of the parameters of fingerprint generation.

For these examples 64 byte long bit strings were applied.

References

Fingerprints- Screening and Similarity, Daylight Theory Manual
 
Copyright © 1999-2008 ChemAxon Ltd.    All rights reserved.