Document to Structure
extract chemical information from documents
Powered by ChemAxon’s Naming technology, Document to Structure (D2S) is a versatile application to extract chemical information from documents. Various types of chemical information in different document formats can be recognized. D2S also applies text OCR and image OSR technologies to extract information from non-searchable PDF documents. Once the structure is converted, the location of that structure is also returned. All these features make D2S an excellent choice for text mining, patent analysis and internal document management.
Retrieve Chemical Information from Documents
RecognitionBased on the Naming technology, various types of chemical information can be recognized and converted to structures, such as IUPAC names, common names, drug trade names, SMILES, InChI, CAS registry numbers. D2S can also apply Optical Structure Recognition (OSR) technology to convert the structure images to structures. (D2S currently supports CLiDE, OSRA, and Imago. A separate license may be required.) Learn more about CLiDE » D2S can distinguish a structure image from a non-structure image (e.g. an IC50 plot) to reduce noise level in results.
VersatilityA wide range of document formats are supported in D2S, including PDF, TXT, HTML, XML, MS Office documents (DOC, DOCX, PPT, PPTX, XLS, XLSX), OpenOffice ODT, etc. Embedded structure objects (ChemDraw, SymyxDraw, MarvinSketch, etc.) in Office documents are extracted as structures. Various image formats, such as TIFF, BMP, etc., are also supported.
ReadabilitySince version 5.9, D2S also works on non-searchable PDF document, i.e. PDF of image format instead of actual text. D2S applies OCR technology to convert image PDF into text, then locate all chemical information. Due to the limitation of OCR technology, the converted text may contain errors. D2S uses a proprietary correction algorithm to identify common OCR mistakes, and revise the text to the correct chemical names. Since many chemical patents are in image based PDFs, this feature is very useful for patent mining.
TraceabilityA document (e.g. a chemical patent) of chemical importance can be many pages long. It would take a scientist a long time to locate a particular chemical structure in it, especially when the structure is in textual format. Using D2S, each structure retrieved from a PDF document are returned with the location of that structure and the type of the original chemical information (IUPAC name, image, SMILES, etc.). This is a powerful text mining function. It can save scientists hours of time when reading a chemical patent.
AvailabilityD2S can be used as a standalone tool. Documents can be opened directly from MarvinView, and the results will be displayed in tabulate format in MarvinView. For multiple document processing, one can use D2S in command-line. D2S is also integrated in ChemAxon’s database management tool, Instant JChem and JChem for Office. A separate license will be required for D2S. Documents can be opened directly in these tools, and the D2S result will be imported as a database table. D2S is also available in workflow tools, like Pipeline Pilot and KNIME, as part of the ChemAxon component collection. Like many other ChemAxon tools, D2S is also available as a command-line tool for batch processing, and as an API for custom developed systems.
To streamline bulk processing, a new sister product Document to Database (D2DB) is coming out. D2DB can crawl a list of documents from a file system or a Documentum repository. An index of all structures and document information are generated as a database. All this information can be visualized and searched through Instant JChem or a web application. Please contact us for a pre-release version or request support for other document sources.
Chemicalize.org – Find Chemical Structures on the WebWe have implemented Document to Structure functionality into a free website called chemicalize.org, to extract chemical information from webpages and documents. A webpage URL can be submitted and it will be opened in the Webpage Viewer, with all chemical information converted to 2D structures.
The Document Viewer brings the same capabilities to PDF files: to view them in the user’s local browser and to see all recognized structures in the text. All structures are summarized on top and downloadable as a structure file. By clicking the structure on top, the view will highlight all occurrences of the same structure information on the entire page. All extracted structures are saved on the web server and can be queried through structure searches. For more information please visit www.chemicalize.org