1. i have academic license for jchem and generatemd
2. i have a windows xp desktop pc
3. i have a file conitaing 50,000~ molecules represented as smiles
4. i wish to compute as many possible descriptors as i can for qspr/qsar
Hi,
You know how to program in JAVA and to use Eclipse?
Then you can use the book Molecular Descriptors from Todeschini and
programm all the descriptors with the JCHEM API.
If you don't know JAVA its going to be harder. You can
in principle use cxcalc or generateMD to generate all the descriptors using
a XML command sheet. Or you can use Instant-JChem to
generate most of the descriptors, but due to current
restrictions Instant-JChem can only have 256 or 512 columns
so you can not include all the possible descriptors.
Furthermore it is not possible to apply a general XML sheet
(write once-use often) to generate most of the descriptors.
Now you also have to be clear, what kind of descriptors you
want to use 0D,1D,2D,3D,4D molecular descriptors? All of them? (See Engel\Gasteiger Cheminformatics)
* 0D - bond counts, mol weight, atom counts
* 1D - fragment counts, H-Bond acc/don, Crippen, PSA, SMARTS
* 2D - topological descriptors (Balaban, Randic, Wiener, BCUT, kappa, chi)
* 3D - geometrical descriptors (3D WHIM, 3D autocorrelation, 3D-Morse) + surface properties + COMFA
* 4D - 3D coordinates + conformations (JCHEM conformer, CORINA, gold set, Crystaleye)
The good thing about the JCHEM API is, that in principle you can implement most of the stuff very easily. Those
functions are attached at the bottom. The 1D fragment counts can be implemented using a SMARTS matcher function.
Among those fingerprints are the PubChem Fingerprints or the public
OpenBabel SMARTS implementation. You can also use MCS maximum common substructures (LIBMCS) to create such
patterns only for your dataset or any other dataset (like PubChem).
You can easily calculate 2000 descriptors with different
software applications, see moleculardescriptors.eu
For a small test set of 150 molecules you can use VCCLAB from Igor Tetko for testing the effectiveness of some of
the descriptors (you want to implement with the JCHEM API).
Or you can use JOELIB or better the CDK Descriptor Calculator GUI from Rajarshi Guha.
Beware! Most of the descriptors you can calculate
will have no impact. You need to use feature selection to find useful descriptors for regression or classification.
It is also helpful to prevent overfitting by dividing your dataset into a 70% development and 30% test set
and have a independent external validation set at hand.
You can additionally use v-fold cross-validation or bootstrapping for your development test set.
All those methods are known since the 70s of the last century.
Do not use the R^2=0.999999999 linear fit scam.
Use prediction errors or R^2, Q^2 for independent datasets or other measurements (do not fool yourself).
For the classification or regression statistics it absolutely
does not matter which method you use. The best case is to test all methods or build ensemble methods or group contribution methods which may include:
Generalized Linear Models (GLM)
General Discriminant Analysis
Binary logit (logistic) regression
Binary probit regression
Nonlinear models
Multivariate adaptive regression splines (MARS)
Tree models
Standard Classification Trees (CART)
Standard General Chi-square Automatic Interaction Detector (CHAID)
Exhaustive CHAID
Boosting classification trees
Neural Networks
Multilayer Perceptron
neural network (MLP)
Radial Basis Function neural network (RBF)
Machine Learning
Support Vector Machines (SVM)
Naive Bayes classifier
k-Nearest Neighbors (KNN)
Fragment counts using OpenBabel counts and the
JCHEM SMARTS matching function:
Code:
# SMARTS Patterns for Functional Group Classification
#
# written by Christian Laggner
# Copyright 2005 Inte:Ligand Software-Entwicklungs und Consulting GmbH
#
# Released under the Lesser General Public License (LGPL license)
# see http://www.gnu.org/copyleft/lesser.html
# Modified from Version 221105
# Project homepage: http://sourceforge.net/projects/openbabel
hey thanks a lot.
i was hoping that there would be some kind of simple answer.
i know a lot about ml stuff.
but biggest problem is getting the descriptors.
vcc lab allows only 150 molecules to be processed at a time see ms i will have to go back to cdk etc.
thanks for help
B) Going back to CDK does not help you if you
can not program in JAVA. If you can program in JAVA
its like that:
1) Use MolImporter
2) Load and loop through all molecules
3) Initialize the plugin (see table above)
4) Perform calculation
5) Output calculation
For each of the plugins from the large list above
you can repeat that by simply calling them and
adding more functions, for the topological descriptors it
looks like that, and to be honest I am not quite
sure what is simpler than that (if you know JAVA).
The code is not pretty but it works and its quickly to built.
/** Defines a MolImporter object to the structure file. */
private static MolImporter createMolImporter(String filename) {
MolImporter mi = null;
try{
File f = new File(filename);
FileInputStream fis = new FileInputStream(f);
mi = new MolImporter(fis);
} catch(FileNotFoundException ex) {
System.err.println(filename+": not found");
System.exit(1);
} catch(MolFormatException ex) {
System.err.println(filename+": "+ex.getMessage());
System.exit(1);
} catch(Exception ex) {
System.err.println("Error: "+filename+" is not a structure file.");
System.exit(1);
}
return mi;
}
/** counts molecules from a structure file. */
private static long countMolecules(String filename) throws PluginException, MolFormatException, IOException
{
MolImporter mi = createMolImporter(filename);
long globalmolcounter = 0;
while (( mi.read()) != null) {
globalmolcounter++;
}
mi.close();
return globalmolcounter;
}
public static void main(String[] args) throws PluginException, MolFormatException, IOException {
String filename = "d:/temp/c6h6.smi";
System.out.println("Number of molecules in " + filename+ ": "+ countMolecules(filename));
MolImporter mi = createMolImporter(filename);
TopologyAnalyserPlugin topologyplugin = new TopologyAnalyserPlugin();
// for each input molecule run the calculation and display the results
Molecule target = null; long molcounter = 0; long totalerrors = 0;
while ((target = mi.read()) != null) {
// set the input molecule
topologyplugin.setMolecule(target);
try {
// run the calculation
topologyplugin.run();
//conversion double to string - if you want calculations with doubles use tempXXX
//loss of precision possible 12-decimals
java.text.DecimalFormat df12 = new java.text.DecimalFormat("0.000000000000");
// maybe prettier to put them in array or LIST ?
int count = target.getAtomCount();
int aliphaticatomCount = topologyplugin.getAliphaticAtomCount();
int aliphaticbondcount = topologyplugin.getAliphaticBondCount();
int aliphaticringcount = topologyplugin.getAliphaticRingCount();
int aromaticatomcount = topologyplugin.getAromaticAtomCount();
int aromaticbondcount = topologyplugin.getAromaticBondCount();
int aromaticringcount = topologyplugin.getAromaticRingCount();
int asymmetricatomcount = topologyplugin.getAsymmetricAtomCount();
double tempbalabanindex = topologyplugin.getBalabanIndex();
String balabanindex = df12.format(tempbalabanindex);
int bondcount = topologyplugin.getBondCount();
int carboaromaticringcount = topologyplugin.getCarboaromaticRingCount();
int carboringcount = topologyplugin.getCarboRingCount();
int chainatomcount = topologyplugin.getChainAtomCount();
int chainbondcount = topologyplugin.getChainBondCount();
int chiralcentercount = topologyplugin.getChiralCenterCount();
boolean tempconnectedGraph = topologyplugin.isConnectedGraph();
int connectedGraph= tempconnectedGraph?1:0;
int cyclomaticNumber = topologyplugin.getCyclomaticNumber();
int fusedaliphaticringcount = topologyplugin.getFusedAliphaticRingCount();
int fusedaromaticringcount = topologyplugin.getFusedAromaticRingCount();
int fusedringcount = topologyplugin.getFusedRingCount();
double temphararyIndex = topologyplugin.getHararyIndex();
String hararyIndex = df12.format(temphararyIndex);
int heteroaromaticringcount = topologyplugin.getHeteroaromaticRingCount();
int heteroringcount = topologyplugin.getHeteroRingCount();
int hyperWienerIndex = topologyplugin.getHyperWienerIndex();
int largestringsize = topologyplugin.getLargestRingSize();
int plattIndex = topologyplugin.getPlattIndex();
double temprandicIndex = topologyplugin.getRandicIndex();
String randicIndex = df12.format(temprandicIndex);
int ringatomcount = topologyplugin.getRingAtomCount();
int ringbondcount = topologyplugin.getRingBondCount();
int ringcount = topologyplugin.getRingCount();
int rotatablebondcount = topologyplugin.getRotatableBondCount();
int smallestringsize = topologyplugin.getSmallestRingSize();
int szegedIndex = topologyplugin.getSzegedIndex();
int wienerIndex = topologyplugin.getWienerIndex();
int wienerPolarity = topologyplugin.getWienerPolarity();
I added the three files, the example SMILES from all
C6H6 isomers were calculated using the CDK.
Given the fact that you only need 5 lines of code
with the JChem API which actually perform the calculation
I think its quite simple. Its actually a no brainer. Just adding
up routines. The only thing which would be nice to have
a parser which loops through all the XML properties
and then automatically adds each new descriptor
and a calculation line to the JAVA code. But this would require
some serious programming and I am just too lazy for that,
or lets say that goes beyond my programming knowledges.
1. i have academic license for jchem and generatemd
2. i have a windows xp desktop pc
3. i have a file conitaing 50,000~ molecules represented as smiles
4. i wish to compute as many possible descriptors as i can for qspr/qsar
Hi,
as an academic user you are entitled to use all jchem tools without any size or other ind of limitations. Your ~50K compounds can be processed without any problems (i.e. practical memory or time limits will not be reached).
You can use generatemd to calculate complex descriptors like fingerprints, you can even incorporate your own descriptors, in which case, however, you need to write some java code.
As Tobias mentioned, cxcalc can also be quite useful and relevant for your prject. That program can calculate a large number of physico-chemical properties as well as topological and geometrical descriptors and write results in standard text files that are easy to process further. For a detailed list of avaialble properties you may wish to follow this link:
finally i could caculate the chemical fingerprints using
generatemd c aids.sdf -k CF -o descp.txt
however the desc.txt contains 34 integers
what do theses integers represent are these binary fingerprints?
if yes how can i get 1/0 values?
mvargyas
Joined: 21 May 2004
Posts: 801
ChemAxon personnel
great!
What you got in the output file is a binary fingerprint in decimal text representation. Each consecutive 32 bits of the binary fingerprint are respected as an integer value and that value is printed in decimal format as readable text. This is a compact representation, much shorter than a 0,1 text. If you insists on using 0,1 text then add the -2 flag to the command line of generatemd. (See the command line help, generatemd -x ). The user's guide may also be useful: http://www.chemaxon.com/jchem/doc/user/GenerateMD.html.)
I still do not understand your real goal, but in most cases the binary text format is not needed and not so useful. For any kind of calculations the integers are just fine, you can directly compare them by tanimoto etc.
i want to use those fingerprints as descriptors.
also i want to load them into matlab for further calculation of tanimoto etc hence i am using text format so that i can load the delimited file into matlab.
however i would like to know whether is it possible to calculate a similarity matrix i.e. there are 4773 molecules (bursi mutagencity) i want a tanimoto similarity matrix 4773 *4773 similarity values is it possible with screenmd?
Thanks
and you call it with evaluator like this (but beware this does not work
for logP because it is a real number and the array function
is only defined as integer:
to calculate the similarity matrix i tried following code using screenmd.
csa.sdf csa2.sdf are same files containing same molecules.
i used
screenmd csa.sdf csa2.sdf -g -k CF -M Tanimoto -o output.txt
it seems to work.
actually while writing this post the calculation is going on.
mvargyas
Joined: 21 May 2004
Posts: 801
ChemAxon personnel
Indeed, you can calculate the similarity matrix using screenmd, just make sure that the dissimilarity threshold is 1 (for tanimoto, or a very large number when using Euclidean metric).
Regarding the use of the chemical fingerprint as a descriptor: it is possible to use the decimal values for further analysis, e.g. in matlab, there is no need to use the binary 0,1 text format. However, if you would like to perform any kind of dimension reduction then the binary form must be used.
thanks Miklos
i got the similarity matrix.
could you tell me how dissimilarity threshold affect the whole procedure and how can i set it using command line?
also i want similarity values between 0 ~1. 1 indicates most similiar or equivalent molecule.
thanks
This is because you calculate the dissimilarity... The dissimilarity is often preferred over similarity as there are many common metrics (e.g. Euclidean) that aren't similarity but distance metrics and thus aren't upper bounded.
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum You cannot attach files in this forum You can download files in this forum