Web Wiz - Green Windows Web Hosting

  New Posts New Posts RSS Feed - Non-English PDF Conversion to XML
  FAQ FAQ  Forum Search   Events   Register Register  Login Login

Non-English PDF Conversion to XML

 Post Reply Post Reply
Author
proaudience View Drop Down
Groupie
Groupie


Joined: 25 July 2002
Location: India
Status: Offline
Points: 91
Post Options Post Options   Thanks (0) Thanks(0)   Quote proaudience Quote  Post ReplyReply Direct Link To This Post Topic: Non-English PDF Conversion to XML
    Posted: 18 June 2007 at 6:17pm
Hi,
I'm a novice as far as this subject is concerned. Would like to know some basic info about converting pdf files into xml ones. Is it possible to convert a pdf file with non-English text into an xml one, and then filter the resulting file for specific words to carry out analysis/ queries of one's own choice?

Looking forward to some useful remarks. Many thanks!
Back to Top
proaudience View Drop Down
Groupie
Groupie


Joined: 25 July 2002
Location: India
Status: Offline
Points: 91
Post Options Post Options   Thanks (0) Thanks(0)   Quote proaudience Quote  Post ReplyReply Direct Link To This Post Posted: 18 June 2007 at 8:06pm
I have some info available online in my native language in the pdf format. What I need to know is, could it be converted to xml file, and then the resulting text (which will be non-English of course with a different font), filtered out to get entries based on specific non-English words? Will this be possible in practice to do in my case?
Back to Top
KCWebMonkey View Drop Down
Senior Member
Senior Member
Avatar
Go Chiefs!

Joined: 21 June 2002
Status: Offline
Points: 1319
Post Options Post Options   Thanks (0) Thanks(0)   Quote KCWebMonkey Quote  Post ReplyReply Direct Link To This Post Posted: 19 June 2007 at 3:38am
are you wanting to do this locally or online? I found a utility that runs locally and converts PDF to HTML or XML: http://pdftohtml.sourceforge.net/
Back to Top
proaudience View Drop Down
Groupie
Groupie


Joined: 25 July 2002
Location: India
Status: Offline
Points: 91
Post Options Post Options   Thanks (0) Thanks(0)   Quote proaudience Quote  Post ReplyReply Direct Link To This Post Posted: 19 June 2007 at 3:07pm
Originally posted by KCWebMonkey KCWebMonkey wrote:

are you wanting to do this locally or online? I found a utility that runs locally and converts PDF to HTML or XML: http://pdftohtml.sourceforge.net/


Thanks KCWebMonkey. Either locally or online both will do. The problem is it has become difficult to extract fonts from pdf files since 2000, though there are ways people speak about in online forums. This question was put forward to me by somebody who needed loads of non-English text to be converted into xml files and then get it analyzed for specific words of that language. Both of us being rather uneducated in such tricky computer matters, I decided to field this question for opinions here.

The link you have forwarded is good enough, but I'll have to first learn the ways for installing it properly,since my knowledge stops to the usual exe and zip files that software come with. LOL



Edited by proaudience - 19 June 2007 at 3:08pm
Back to Top
KCWebMonkey View Drop Down
Senior Member
Senior Member
Avatar
Go Chiefs!

Joined: 21 June 2002
Status: Offline
Points: 1319
Post Options Post Options   Thanks (0) Thanks(0)   Quote KCWebMonkey Quote  Post ReplyReply Direct Link To This Post Posted: 19 June 2007 at 3:26pm

check the forum linked to that project. I saw a thread that explains how to use it as a command line utility.

Back to Top
 Post Reply Post Reply

Forum Jump Forum Permissions View Drop Down

Forum Software by Web Wiz Forums® version 12.08
Copyright ©2001-2026 Web Wiz Ltd.


Become a Fan on Facebook Follow us on X Connect with us on LinkedIn Web Wiz Blogs
About Web Wiz | Contact Web Wiz | Terms & Conditions | Cookies | Privacy Notice

Web Wiz is the trading name of Web Wiz Ltd. Company registration No. 05977755. Registered in England and Wales.
Registered office: Web Wiz Ltd, Unit 18, The Glenmore Centre, Fancy Road, Poole, Dorset, BH12 4FB, UK.

Prices exclude VAT at 20% unless otherwise stated. VAT No. GB988999105 - $, € prices shown as a guideline only.

Copyright ©2001-2026 Web Wiz Ltd. All rights reserved.