Document to Structure Developer's Guide

Version 5.8.2

Contents

 

Introduction

The DocumentExtractor class extracts chemical names from text documents and converts them to chemical structures.

 

Basic API usage

Example usage:

// We have a document to process
java.io.Reader document = ...;

DocumentExtractor x = new DocumentExtractor();
x.processHTML(document); // or processPlainText(document) for input in plain text format

// Iterate through the hits
for (Hit hit : x.getHits()) {
  System.out.println(hit.position + ": " + hit.text + ": " + hit.structure.toFormat("smiles"));
}

The field hit.position contains the position of the first character of the name in the document.

Note that hit.text contains the name as it appears in the source document. A cleaned version (of possible OCR errors, typos, ...) can be retrieved with hit.structure.getName().

This class can also be called on the command-line. It then expects the name of a plain text file as the first argument (or from the standard input when absent). The list of hits is printed on the standard output.

 

See also

Do you have a question? Would you like to learn more? Please browse among the related topics on our support forum or search the website. If you want to suggest modifications or improvements to our documentation email our support directly!