Exploring and Visualisation of Chemistry in patents with Marvin and Instant JChem
We present a grid-based solution for chemical named entity recognition (NER) in full text patent collections that are provided in native PDF format. Our architecture identifies and extracts IUPAC and trivial names of chemical compounds and translates them into InChI keys that can subsequently be used to generate structures for each identified entity with Marvin. All structures are finally stamped into the original PDF as ‘pop-up’ chemicals together with hyperlinks to corresponding sites of ChemSpider and Pubmed. A generated bookmark tree inside the PDF allows convenient access for all identified compounds. Additionally all retrieved chemicals are stored in a ChemAxon Instant JChem database together with a reference to the original patent. Instant JChem enables structural search for the processed patent collection and various filtering options. The work-flow is based on UIMA (Unstructured Information Management Architecture) and can easily be adapted to incorporate different chemical NER tools. UNICORE is used to access grid resources for efficient parallelization of all processes.