Search This Blog

Saturday, August 24, 2013

Reading PDF files in R

the basic process was to download
xpdf and add it to the windows path, thus making it accessable from
within R. Two methods follow:

Method One (easiest) - using the awesome ?system command:

(1) Download xpdf (whichever is the latest version):
ftp://ftp.foolabs.com/pub/xpdf/xpdf-3.02pl4-win32.zip
(2) Unzip it

# system(paste("[app]", "[pdf file]"), wait = FALSE)

> system(paste('"C:/Program Files/xpdf/pdftotext.exe"', '"C:/Documents and Settings/tony/Desktop/test/r-intro.pdf"'), wait=FALSE)

Method Two - if you want to use the tm package like I did last year,
?readPDF requires the following (not documented anywhere that I know
of, but this is what you do):

(1) Download xpdf (whichever is the latest version):
ftp://ftp.foolabs.com/pub/xpdf/xpdf-3.02pl4-win32.zip
(2) Unzip it
(3) Download the Redmond utility for adding files to your windows path
(free version button is in the top left of the page):
http://redmondlab.googlepages.com/path
(4) Unzip it
(5) Open the 'Redmond Path' application.
(6) Click on the green plus in the top left hand corner '+'.
(7) Naviagate to the folder which contains the files: C:/../
xpdf-3.02pl4-win32
(8) Add it and click Ok.

Then you can can do something like:

> library(tm)
> my.path <- and="" br="" esktop="" here="" in="" ocuments="" pdfs="" put="" settings="" tony="" your="">> Corpus(DirSource(my.path), readerControl = list(reader=readPDF))

There are some limitations to how well the conversions work depending
on the pdf file, but it was so long ago now that I'm afraid I don't
remember the details.

No comments:

Post a Comment

Thank you