Machine Learning Applications with JChem

September 2013 Author: ()

We present a method for automating the identification and optimization of predictive models in the drug discovery arena. This technique selects appropriate fingerprints and learning methodologies to produce models of maximal predictive quality. The octanol-water partition coefficient (logD) is a difficult but hugely impactful property to predict due to its correlation with many ADME properties, and we demonstrate how our automated approach was able to produce a model of high quality using inhouse data alone. This in contrast to a “first principles” approaches where an accurate method for the prediction of logD has remained elusive: it being reliant on subtle energy differences between multiple microstates. Our method utilized the JChem extended connectivity fingerprints (ECFP) molecular feature descriptors, as well as open source machine learning software (Weka). Model parameters are automatically optimized by the open source Autocorrelator tool that interfaces with the Sun Grid Engine in order to make efficient use of our computational grid. Our logD model has maintained a strong correlation (R2= 0.66) since its internal deployment with over 800 molecules having been evaluated prior to synthesis. For this presentation we will provide example code for how we employed modern machine learning algorithms to generate predictive models using the JChem and Weka Java APIs. Furthermore, the strategy has been generalized and applied to several other ADME properties such as, PGP efflux, P450 inhibition, solubility, etc.