Automated Extraction of Chemical Knowledge from Patents with JChem
Here we present our automated system for extracting compounds and gene names from patents. We employ a general method that can operate on patents from all patent issuing authorities in a number of formats. After extracting all compounds from a patent, further analysis is carried out in order to identify and mark compounds that are believed to be exemplars of the chemical matter being covered by the patent, as opposed to reagents, standards, etc. Finally, to summarize the chemical matter represented in the patent, the algorithm selects a “centroid” compound, which is the compound that has the largest number of nearest neighbors in the patent. In general, the complete perception takes less than one minute per patent. Therefore, with multithreaded code it is possible to perform a comprehensive analysis of hundreds to thousands of patents in a day by taking advantage of multiple processor cores simultaneously. The speed of our approach to analyzing patents opens up opportunities for diverse set of applications.