Count The Number Of Words In A PDF File


Answer :

Quick Answer:



pdftotext myfile.pdf - | wc -w


Long Answer:



If on Unix, you can use pdftotext:




  • http://linux.about.com/od/commands/l/blcmdl1_pdftote.htm



and then do the word count in the generated file. If on Unix, you can use:



wc -w converted-pdf.txt


to get the word count.



Also, see the comment by frabjous - basically, you can do it in one step by piping to stdout instead to a temporary file:



pdftotext myfile.pdf - | wc -w


This is a hard task not not easy to solve. If you really want an exact result, copy paragraph by paragraph for your PDF viewer into a text file and check it with the wc -w tool. The reason why not to use pdftotext in that case is: mathematical formulas may get also into the output and regarded as "words". (Alternatively you could edit the output you get from pdftotext). Another reason why this may fail are the headings: "4.3.2 Foo Bar" is counted as three words.



A way around is only to count words starting with a char out of [A-Za-z]. So what I usally do is a two step approach:




  1. get the list of uniq words and check if there are too much false positives inside:



    pdftotext foo.pdf - | tr " " "\n" | sort | uniq | grep "^[A-Za-z]" > words



    I don't use a dictionary here, as some spelling errors would not count as words.


  2. Get this word list and grep it within the output of pdftotext:



    pdftotext foo.pdf - | tr " " "\n" | grep -Ff words | wc -l




I know this could be done within a one liner, but then I could not easily see the filter result from the first step. The -F may help you as stated by the comment of moi below (thanks).



I just tried out a free program, Translator's Abacus. You can drag and drop various file types (including PDF), and it pops up a browser with a printable report of the word count for each document. It worked fine for me. (It is specifically created for word counts and is only 435 KB... that is, not a "big application"). Translator's Abacus doesn't work on PDF 1.5 or later.



Alternatively: you can just Ctrl+A to select all text in Acrobat Reader and then copy-paste it into a program like Microsoft Word (which has a word count on the status bar at the bottom of the screen).



Comments

Popular posts from this blog

Converting A String To Int In Groovy

"Cannot Create Cache Directory /home//.composer/cache/repo/https---packagist.org/, Or Directory Is Not Writable. Proceeding Without Cache"

Android How Can I Convert A String To A Editable