r - Error trying to read a PDF using readPDF from the tm package -


(windows 7 / r version 3.0.1)

below commands , resulting error:

> library(tm) > pdf <- readpdf(pdftotextoptions = "-layout") > dat <- pdf(elem = list(uri = "17214.pdf"), language="de", id="id1")  error in file(con, "r") : cannot open connection in addition: warning message: in file(con, "r") :   cannot open file 'c:\users\raffael\appdata\local\temp     \rtmps8uql1\pdfinfo167c2bc159f8': no such file or directory 

how solve issue?


edit i

(as suggested ben , described here)

i downloaded xpdf copied 32bit version c:\program files (x86)\xpdf32 , 64bit version c:\program files\xpdf64

the environment variables pdfinfo , pdftotext referring respective executables either 32bit (tested r 32bit) or 64bit (tested r 64bit)


edit ii

one confusing observation starting fresh session (tm not loaded) last command alone produce error:

> dat <- pdf(elem = list(uri = "17214.pdf"), language="de", id="id1")  error in file(con, "r") : cannot open connection in addition: warning message: in file(con, "r") :   cannot open file 'c:\users\raffael\appdata\local\temp\rtmpki5gnl      \pdfinfode8283c422f': no such file or directory 

i don't understand @ because function variable not defined tm.readpdf yet. below you'll find function pdf refers "naturally" , returned tm.readpdf:

> pdf  function (elem, language, id)  {     meta <- tm:::pdfinfo(elem$uri)     content <- system2("pdftotext", c(pdftotextoptions, shquote(elem$uri),          "-"), stdout = true)     plaintextdocument(content, meta$author, meta$creationdate,          meta$subject, meta$title, id, meta$creator, language) } <environment: 0x0674bd8c>  > library(tm) > pdf <- readpdf(pdftotextoptions = "-layout") > pdf  function (elem, language, id)  {     meta <- tm:::pdfinfo(elem$uri)     content <- system2("pdftotext", c(pdftotextoptions, shquote(elem$uri),          "-"), stdout = true)     plaintextdocument(content, meta$author, meta$creationdate,          meta$subject, meta$title, id, meta$creator, language) } <environment: 0x0c3d7364> 

apparently there no difference - why use readpdf @ all?


edit iii

the pdf file located here: c:\users\raffael\documents

> getwd() [1] "c:/users/raffael/documents" 

edit iv

first instruction in pdf() call tm:::pdfinfo() - , there error caused within first few lines:

> outfile <- tempfile("pdfinfo") > on.exit(unlink(outfile)) > status <- system2("pdfinfo", shquote(normalizepath("c:/users/raffael/documents/17214.pdf")),  +                   stdout = outfile) > tags <- c("title", "subject", "keywords", "author", "creator",  +           "producer", "creationdate", "moddate", "tagged", "form",  +           "pages", "encrypted", "page size", "file size", "optimized",  +           "pdf version") > re <- sprintf("^(%s)", paste(sprintf("%-16s", sprintf("%s:",  +                                                       tags)), collapse = "|")) > lines <- readlines(outfile, warn = false) error in file(con, "r") : cannot open connection in addition: warning message: in file(con, "r") :   cannot open file 'c:\users\raffael\appdata\local\temp\rtmpquryx6\pdfinfo8d419174450':   no such file or direc 

apparently tempfile() doesn't create file.

> outfile <- tempfile("pdfinfo") > outfile [1] "c:\\users\\raffael\\appdata\\local\\temp\\rtmpquryx6\\pdfinfo8d437bd65d9" 

the folder c:\users\raffael\appdata\local\temp\rtmpquryx6 exists , holds files none named pdfinfo8d437bd65d9.

intersting, on machine after fresh start pdf function convert image pdf:

 getanywhere(pdf) single object matching ‘pdf’ found found in following places   package:grdevices   namespace:grdevices [etc.] 

but problem of reading in pdf files text, fiddling path bit hit-and-miss (and annoying if work across several different computers), think simplest , safest method call pdf2text using system tony breyal describes here.

in case (note 2 sets of quotes):

system(paste('"c:/program files/xpdf64/pdftotext.exe"',               '"c:/users/raffael/documents/17214.pdf"'), wait=false) 

this extended *apply function or loop if have many pdf files.


Comments

Popular posts from this blog

c++ - Creating new partition disk winapi -

Android Prevent Bluetooth Pairing Dialog -

VBA function to include CDATA -