Thursday 7 June 2012

Re: [dcphp-dev] counting words in .docx and pdf files

Another option would be to use Tika to extract the text and then count it. Tika is a Java application you can directly invoke the JAR file -- see http://tika.apache.org/0.7/gettingstarted.html for more details. Basically you can call it passing the URL of the document and it returns an output of the decoded text.

On 6/5/12 9:40 AM, Graham Christensen wrote:
A .docx is just a zipped up set of XML documents, so if you unzip it and look at `word/document.xml` you might be able to parse out what you need.

-- 
Graham Christensen

On Tuesday, June 5, 2012 at 9:24 AM, John Bloch wrote:

For the PDFs, you might look into using XPDF (http://www.foolabs.com/xpdf/). You'd have to compile xpdf for the environment it will be running on, but then you could just use system() or any similar function to read the PDF. At that point you can use the code below. For a similar example of how to do this, take a look at the "PDF Indexer" extension for Joomla!.

I'm not really sure about docx; I've never had to work with that format. Hopefully this helps get you halfway there, though.

-John

On Jun 5, 2012 8:38 AM, "vit srikanth" <vit.srikanth490@gmail.com> wrote:
counting words in .docx and pdf files?

This is the code for counting the words in .txt and .doc files......
But i want for .docx and for pdf files too.....



<?php
       $f = "document.txt";

       // read into string
       $str = file_get_contents($f);

       // count words
       $numWords = str_word_count($str);
       echo "This file have ". $numWords . " words";
?>


Thank you
Srikanth

--
You received this message because you are subscribed to the Google
Group: "Washington, DC PHP Developers Group" - http://www.dcphp.net
To post, send email to washington-dcphp-group@googlegroups.com
To unsubscribe, send email to washington-dcphp-group+unsubscribe@googlegroups.com
For more options, visit this group at http://groups.google.com/group/washington-dcphp-group?hl=en
--
You received this message because you are subscribed to the Google
Group: "Washington, DC PHP Developers Group" - http://www.dcphp.net
To post, send email to washington-dcphp-group@googlegroups.com
To unsubscribe, send email to washington-dcphp-group+unsubscribe@googlegroups.com
For more options, visit this group at http://groups.google.com/group/washington-dcphp-group?hl=en

--
You received this message because you are subscribed to the Google
Group: "Washington, DC PHP Developers Group" - http://www.dcphp.net
To post, send email to washington-dcphp-group@googlegroups.com
To unsubscribe, send email to washington-dcphp-group+unsubscribe@googlegroups.com
For more options, visit this group at http://groups.google.com/group/washington-dcphp-group?hl=en

--   William Hurley  Manager of Programming  Forum One Communications  http://forumone.com/  703.894.4346

0 comments:

Post a Comment