Thursday 7 June 2012

Re: [dcphp-dev] counting words in .docx and pdf files

I've used that Joomla plugin John mentioned and it has a counterpart for DOCs: http://extensions.joomla.org/extensions/search-a-indexing/site-search/2095

Both work fairly well and you should be able to steal functions from them.

As an aside, I've hacked at them to get them to recognize other file formats as well—just for the purpose of search indexing. If you were trying to parse it into a formatted article, that would be a different story. I don't know how it would play into your need for counting words though.

Anthony

ADG|CREATIVE

Anthony D Paul
Digital Strategist

7151 columbia gateway drive
suite b
columbia, md 21046
443.285.0008 x127
443.379.9930 (mobile)

The information contained in this  transmittal is intended only for the personal and 
confidential use of the  designated recipient named above. Any attachments accompanying 
this  transmission contain information from adgcreative  which is to be  considered
confidential and/or privileged unless otherwise stated. The  information is intended to be 
for the individual(s) or entity(ies) named  on this E-mail. If you are not the intended 
recipient, be aware that any  disclosure, copying, distribution or use of the contents 
of this information is prohibited. If you receive this in error, please notify us by telephone 
at (443) 285-0008 or via e-mail immediately and delete the  original document. Thank you.

From: Graham Christensen <graham@grahamc.com>
To: John Bloch <johnpbloch@gmail.com>
Cc: vit srikanth <vit.srikanth490@gmail.com>, DCPHP Meetup <washington-dcphp-group@googlegroups.com>
Subject: Re: [dcphp-dev] counting words in .docx and pdf files

A .docx is just a zipped up set of XML documents, so if you unzip it and look at `word/document.xml` you might be able to parse out what you need.

-- 
Graham Christensen

On Tuesday, June 5, 2012 at 9:24 AM, John Bloch wrote:

For the PDFs, you might look into using XPDF (http://www.foolabs.com/xpdf/). You'd have to compile xpdf for the environment it will be running on, but then you could just use system() or any similar function to read the PDF. At that point you can use the code below. For a similar example of how to do this, take a look at the "PDF Indexer" extension for Joomla!.

I'm not really sure about docx; I've never had to work with that format. Hopefully this helps get you halfway there, though.

-John

On Jun 5, 2012 8:38 AM, "vit srikanth" <vit.srikanth490@gmail.com> wrote:
counting words in .docx and pdf files?

This is the code for counting the words in .txt and .doc files......
But i want for .docx and for pdf files too.....



<?php
       $f = "document.txt";

       // read into string
       $str = file_get_contents($f);

       // count words
       $numWords = str_word_count($str);
       echo "This file have ". $numWords . " words";
?>


Thank you
Srikanth

--
You received this message because you are subscribed to the Google
Group: "Washington, DC PHP Developers Group" - http://www.dcphp.net
To post, send email to washington-dcphp-group@googlegroups.com
To unsubscribe, send email to washington-dcphp-group+unsubscribe@googlegroups.com
For more options, visit this group at http://groups.google.com/group/washington-dcphp-group?hl=en

--
You received this message because you are subscribed to the Google
Group: "Washington, DC PHP Developers Group" - http://www.dcphp.net
To post, send email to washington-dcphp-group@googlegroups.com
To unsubscribe, send email to washington-dcphp-group+unsubscribe@googlegroups.com
For more options, visit this group at http://groups.google.com/group/washington-dcphp-group?hl=en

--
You received this message because you are subscribed to the Google
Group: "Washington, DC PHP Developers Group" - http://www.dcphp.net
To post, send email to washington-dcphp-group@googlegroups.com
To unsubscribe, send email to washington-dcphp-group+unsubscribe@googlegroups.com
For more options, visit this group at http://groups.google.com/group/washington-dcphp-group?hl=en

0 comments:

Post a Comment