Small Molecules in Big Data: Proceed with Caution!

September 2016 Author: ()

The Schürer research group at The University of Miami is one of three sites for the Big Data to Knowledge (BD2K) Data Coordination Integration Center (DCIC) of the Library of Integrated Network-based Cellular Signatures (LINCS) program. BD2K and LINCS are NIH Common Fund projects with the goal to accelerate data science approaches in biomedical research. The LINCS project generates large datasets of cellular response signatures for small molecule, genetic or environmental perturbations. Signatures include transcriptional, proteomics, cell phenotype, and biochemical profiles in various cell model systems. The BD2K-LINCS DCIC encompasses operational and discovery components; major tasks include to process these diverse datasets, integrate them, make them accessible in a variety of ways, and develop data analytics tools and pipelines. One of the tools developed at the DCIC is the LINCS Data-Portal (LDP). LDP provides integrated access to all LINCS datasets, metadata, aggregate information and links to data analysis tools. To populate information into LDP, we developed several data processing pipelines and registration functionality.

Registration and integration of small molecule chemical structure information required a strict standardization protocol. These include automated and manual curation steps to correct or clean-up chemical structures including FDA approved drugs and clinical compounds. In both steps, we have made use of ChemAxon tools, such as standardizer, calculator plugins, Instant JChem, and Reactor, while also leveraging Pipeline Pilot integration. The subsequent chemical structures are stored in JChem. In LDP we recently added a chemical structure search allowing the user to find compound activity based on structure and common substructure. This is implemented using Marvin for Java Script.
In summary we have implemented an end-to-end data processing pipeline including chemical structure registration and a chemically aware Data Portal powered by ChemAxon.

Download slides in pdf