's crowdsourced database - now in Pubchem

Posted on January 23rd, 2013 at 12:10 pm by András Strácz

Over the past few years, one of the most surprising community effects have been playing out on building a huge crowdsourced database of interesting chemical structures. How?

Webpage Viewer and Document Viewer, two major services of the free website save all visited URLs along with the chemical structures found on the page. Perhaps I don’t need to detail how this is done, but it involves things like Document to Structure and being able to understand every chemical format out there. now has 15000 unique visitors a month – which is a huge growth compared to spring 2012. These users contribute to the database every day, making sure it’s up-to-date and contains new interests as well. The database today contains 327000 structures that were converted from 545000 names and identifiers coming from 367000 webpages. To understand the value of this database, we put it to a test and submitted it to Pubchem.

The process was easy: sign up for a depositor account, create an SDF with the structures and some data fields (registration ID, substance URL) and after a few tries and some fight with their standardization tools plus curation by a nice fella it’s now in Pubchem’s Substance database. Before any evaluation, we became the 27th biggest depositor among 219. That’s huge!

What’s immediately clear is that 20-25% of our database is brand new to Pubchem. Filtering by Rule of 5, we still have 38000 structures. Filtering to same parent and connectivity, 42000. Pubchem has 35 million structures (or more) from a huge array of sources and this tiny website with it’s community provided a large set of novel structures? Now that’s huge!

Follow these links to start playing with the data:

The exact value of this deposition will take weeks and months to realize, some are already working on it – a hat tip to Chris Southan who also gave us the idea. If you have interesting findings, do share!

2Responses to “'s crowdsourced database - now in Pubchem”

  1. Those who read András Strácz’ post right through to the end will have seen a brief mention of a blog item by Chris Southan It is worth reading Chris’ article. He also mentions (reference 82) in Southan, C.; Williams A. J.; Ekins, S. Challenges and recommendations for obtaining chemical structures of industry-provided repurposing candidates. Drug Discovery Today 2103, 18(1-2), 58-70. Chris and his colleagues have been finding structures for the NCATS58 set. I was interested because I have just transcribed a talk Chris Lipinski gave at the fall 2012 ACS meeting.. Chris and Tudor Oprea have been working on both structures and targets in NCATS58. Shameless plug for my meeting report due out in early February. It has a paper by ChemAxon’s Daniel Bonniot de Ruisselet in it too :-)

  2. Oops! Discovery Today 2013, 18(1-2), 58-70 (not 2103)

Leave a Reply

You must be logged in to post a comment.

Newcomer? Register here! Forgot your password? Get it back!