Posted: Thu Jun 09, 2011 3:58 pmPost subject: 'sphere exclusion' clustering - the speed of performance
Could you please advice me on some cluster performance issues.
I need to perform the clustering for a number of compound libraries with ECFP fingerprints. As I understood the 'sphere exclusion' clustering is the most suitable and fastest method in JKlustor for these purposes. I started calculation for the set of 300k compounds about twenty-four hours ago and it's still running. Should it take so long? At this case how long will take to cluster 1-2 mln compounds librariy? Is there any way to increase the speed of calculation?
Windows XP x64 Edition
760 @ 2.80 GHz
Sorry for the late answer. Consider using -v to turn on verbose mode which will pront out progress messages during the clustering process. In case of sphere exclusion (and also bemis-murcko) clustering is done during input time.
Increasing dissimilarity threshold (0.2 in yor case) will decrease cluster count and speeds up clustering process.
I have to bring my apologies, there was obviously some system failure happening when I used sphex with "-d ecfp:tanimoto" option.
I recently triet it once more and it worked just fine, although much slower than the default cfp.
By the way, the default (CFP, as I understand) has produced more meaningful for me clustering results for compounds with fused heterocyclic systems. So, I am curious what parameters are used in JKlustor for both types of fingerprints:
bond depth, bit length, number of bits?
Is it possible to make any ajustements for these parameters for JKlustor in some XML configuration file?
Thank you very much in advance for your suggestions,
Gabor ChemAxon personnel
Joined: 29 May 2005
<ECFPConfiguration Version="0.1"><Parameters Length="1024" Diameter="4" Counts="no"/><IdentifierConfiguration><!-- Default atom properties (switched on by Value=1) --><Property Name="AtomicNumber" Value="1"/><Property Name="HeavyNeighborCount" Value="1"/><Property Name="HCount" Value="1"/><Property Name="FormalCharge" Value="1"/><Property Name="IsRingAtom" Value="1"/><!-- Other built-in atom properties (switched off by Value=0) --><Property Name="ConnectionCount" Value="0"/><Property Name="Valence" Value="0"/><Property Name="Mass" Value="0"/><Property Name="MassNumber" Value="0"/><Property Name="HasAromaticBond" Value="0"/><Property Name="IsTerminalAtom" Value="0"/><Property Name="IsStereoAtom" Value="0"/></IdentifierConfiguration><StandardizerConfiguration Version="0.1"><Actions><Action ID="aromatize" Act="aromatize"/><RemoveExplicitH ID="RemoveExplicitH" Groups="target"/></Actions></StandardizerConfiguration><ScreeningConfiguration><ParametrizedMetrics><ParametrizedMetric Name="Tanimoto" ActiveFamily="Generic" Metric="Tanimoto" Threshold="0.2"/><ParametrizedMetric Name="Euclidean" ActiveFamily="Generic" Metric="Euclidean" Threshold="10"/></ParametrizedMetrics></ScreeningConfiguration></ECFPConfiguration>
It might be helpful for publishing data as well as to figure out the range for the sphere radius ajustment.
Internally JKlustor uses 0 .. 1 dissimilarity range, usually 0 as the most similar (identical fingerprints) and 1 for the possible most dissimilar values (*). Generally starting sphere exclusion clustering from high (even 0.9) dissimilarity radius and checking cluster sizes (with following import using "-v" option to turn on verbose mode or using "-o wrstat" option to obtain statistics) while decreasing seems to be a useful approach.
JKlustor web gui provides the "matrix" view to compare cluster representants/centroids and individual structures dissimilarity values. Also fingerprint binary representation is visualized on the individual structures page.
(*) For LibraryMC(E)S a metric called "commonbits" implemented which calculated by dividing simultaneously set bits count by fingerprint length and subtracting the result from 1.0. This dissimilarity metric will not give 0 for identicall structures.
In JKlustor the default fingerprint parameters are used which parameters are hardwired in the code. These can not be modified in JKlustor; using some kind of paramateriztaion is a planned feature in the near future. In 5.7 the verbose mode in JKlustor will be extended to print used main parameters (length, etc).
The referenced files contains fingerprint congfiguration examples (which can be used in other products); the main fingerprint parameters in those files match to the hardwired defaults. Configurations in these example files are not (and can not) read by JKlustor.
The actual hardwired configuration used in cfp (this modification in the contents of cfp.xml will be corrected in release 5.7):
<?xml version="1.0" encoding="UTF-8"?>
<ChemicalFingerprintConfiguration Version ="0.3" schemaLocation="cfp.xsd">
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum You cannot attach files in this forum You cannot download files in this forum