Technical Support Forum Index
Technical Support Forum
Access ChemAxon scientists and developers here. For registration and login issues contact website support.
document to structure
To watch this topic for replies  Register (enables digests) or give email address:
Reply to topic
Display posts from previous:   
    View previous topic :: View next topic    
Author Message
surojit.sadhu

Joined: 13 Jan 2011
Posts: 18

View user's profile
Visit poster's website

Back to top
Link to postPosted: Thu Dec 29, 2011 1:41 pmPost subject: document to structure Reply with quote

The attachment has IUPAC names, bit complex ones, I want them to be converted to structures. 

I tried to using document to structure through MarvinView but none of these IUPAC names were converted. 

Please let me know what is the issue here.




 Filename: Patent with chemical names.64-69.pdf    Filesize: 248.61 KB    Downloaded: 85 Time(s)
 Description:  
alexa
ChemAxon personnel
Joined: 17 May 2004
Posts: 2011

View user's profile
Visit poster's website

Back to top
Link to postPosted: Thu Dec 29, 2011 6:58 pmPost subject: Reply with quote

HI, the attachment pdf is a scanned image, there is no suport yet for extracting text from scanned images. Did you try or can you get the text version of the pdf?

surojit.sadhu

Joined: 13 Jan 2011
Posts: 18

View user's profile
Visit poster's website

Back to top
Link to postPosted: Fri Dec 30, 2011 7:21 amPost subject: Reply with quote

Hi Alexa,

Thanks for your reply.

I used Marvin 5.8 test version which says that "Automatic text OCR (optical character recognition) has been added to support document to structure conversion of scanned (non searchable) PDF documents. " as mentioned here. I could extract a lot of srtuctures from the full pdf file but these particular pages were not converted.

Lets take the following four names as a case study :

 

(4S)-4-{[(3R)-3-amino-4-(2,4,5-trifluorophenyl)butanoyl]amino}-1-{3-[(phenylcarbonyl)sulfanyl]propanoyl}-L-proline ditrifluoroacetate salt

N-[(1S)-1-carboxy-3-phenylpropyl]-L-alanyl-(4S)-4-{[(3R)-3-amino-4-(2,4,5-trifluorophenyl)butanoyl]amino}-L-proline ditrifluoroacetate salt

Methyl N-[(2S)-1-ethoxy-1-oxo-4-phenylbutan-2-yl]-L-alanyl-(4S)-4-{[(3R)-3-amino-4-(2,4,5-trifluorophenyl)butanoyl]amino}-L-prolinate ditrifluoroacetate salt

N-[(1S)-1-carboxy-3-phenylpropyl]-L-valyl-(4S)-4-{[(3R)-3-amino-4-(2,4,5-trifluorophenyl)butanoyl]amino}-L-proline dilithium salt.

 if I type the first name in MarvinSketch->edit->Import name without the "ditrifluoroacetate salt" the structure is retrieved. 

but all the rest give error with/without the salt component. I have attached the error log.




 Filename: naming error.txt    Filesize: 4.32 KB    Downloaded: 86 Time(s)
 Description:  
dbonniot
ChemAxon personnel
Joined: 20 Mar 2006
Posts: 322

View user's profile

Back to top
Link to postPosted: Fri Dec 30, 2011 11:43 pmPost subject: Reply with quote

Hi Surojit,

Thanks for the testing and the detailed report. I'm glad you found extraction working in many cases.

For the attached pages, it seems the main problem is that the patent has line numbers at the begining of each line. For names that span over two lines, the line number ends up in the middle of the name, which prevents the conversion. We will work on a solution, probably for 5.9.

Of the four names you extracted, what I found is that 5.7 indeed converts only the first one (but also when including the ditrifluoroacetate salt part). However 5.8 does convert all four of them. Can you confirm that?

Best regards,

Daniel

surojit.sadhu

Joined: 13 Jan 2011
Posts: 18

View user's profile
Visit poster's website

Back to top
Link to postPosted: Sat Dec 31, 2011 9:28 amPost subject: Reply with quote

Hi Daniel,

 

Thaks for the information. 

I have successfully converted the IUPAC names to structure using Marvin 5.8. :)

One more query:

If I have have an IUPAC name as an Image, and I convert that image into .pdf, why doesn't it ectract the structure?

dbonniot
ChemAxon personnel
Joined: 20 Mar 2006
Posts: 322

View user's profile

Back to top
Link to postPosted: Sat Dec 31, 2011 9:38 amPost subject: Reply with quote

surojit.sadhu wrote:

If I have have an IUPAC name as an Image, and I convert that image into .pdf, why doesn't it ectract the structure?

Could you attach the pdf?

surojit.sadhu

Joined: 13 Jan 2011
Posts: 18

View user's profile
Visit poster's website

Back to top
Link to postPosted: Sat Dec 31, 2011 9:55 amPost subject: Reply with quote

Please find attached the pdf.

If I try to extract the structure from this file the I only get Acetate salt and the rest is not there.




 Filename: Doc1.pdf    Filesize: 69.02 KB    Downloaded: 77 Time(s)
 Description:  
dbonniot
ChemAxon personnel
Joined: 20 Mar 2006
Posts: 322

View user's profile

Back to top
Link to postPosted: Sat Dec 31, 2011 12:24 pmPost subject: Reply with quote

There are OCR errors on this image. I improved the situation in the 5.9 branch.

Reply to topic
Page 1 of 1


To watch this topic for replies   Register (enables digests) or give email address  
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum