r - Error trying to read a PDF using readPDF from the tm package -
(windows 7 / r version 3.0.1)
below commands , resulting error:
> library(tm) > pdf <- readpdf(pdftotextoptions = "-layout") > dat <- pdf(elem = list(uri = "17214.pdf"), language="de", id="id1") error in file(con, "r") : cannot open connection in addition: warning message: in file(con, "r") : cannot open file 'c:\users\raffael\appdata\local\temp \rtmps8uql1\pdfinfo167c2bc159f8': no such file or directory
how solve issue?
edit i
(as suggested ben , described here)
i downloaded xpdf copied 32bit version c:\program files (x86)\xpdf32
, 64bit version c:\program files\xpdf64
the environment variables pdfinfo
, pdftotext
referring respective executables either 32bit (tested r 32bit) or 64bit (tested r 64bit)
edit ii
one confusing observation starting fresh session (tm not loaded) last command alone produce error:
> dat <- pdf(elem = list(uri = "17214.pdf"), language="de", id="id1") error in file(con, "r") : cannot open connection in addition: warning message: in file(con, "r") : cannot open file 'c:\users\raffael\appdata\local\temp\rtmpki5gnl \pdfinfode8283c422f': no such file or directory
i don't understand @ because function variable not defined tm.readpdf yet. below you'll find function pdf refers "naturally" , returned tm.readpdf:
> pdf function (elem, language, id) { meta <- tm:::pdfinfo(elem$uri) content <- system2("pdftotext", c(pdftotextoptions, shquote(elem$uri), "-"), stdout = true) plaintextdocument(content, meta$author, meta$creationdate, meta$subject, meta$title, id, meta$creator, language) } <environment: 0x0674bd8c> > library(tm) > pdf <- readpdf(pdftotextoptions = "-layout") > pdf function (elem, language, id) { meta <- tm:::pdfinfo(elem$uri) content <- system2("pdftotext", c(pdftotextoptions, shquote(elem$uri), "-"), stdout = true) plaintextdocument(content, meta$author, meta$creationdate, meta$subject, meta$title, id, meta$creator, language) } <environment: 0x0c3d7364>
apparently there no difference - why use readpdf @ all?
edit iii
the pdf file located here: c:\users\raffael\documents
> getwd() [1] "c:/users/raffael/documents"
edit iv
first instruction in pdf()
call tm:::pdfinfo()
- , there error caused within first few lines:
> outfile <- tempfile("pdfinfo") > on.exit(unlink(outfile)) > status <- system2("pdfinfo", shquote(normalizepath("c:/users/raffael/documents/17214.pdf")), + stdout = outfile) > tags <- c("title", "subject", "keywords", "author", "creator", + "producer", "creationdate", "moddate", "tagged", "form", + "pages", "encrypted", "page size", "file size", "optimized", + "pdf version") > re <- sprintf("^(%s)", paste(sprintf("%-16s", sprintf("%s:", + tags)), collapse = "|")) > lines <- readlines(outfile, warn = false) error in file(con, "r") : cannot open connection in addition: warning message: in file(con, "r") : cannot open file 'c:\users\raffael\appdata\local\temp\rtmpquryx6\pdfinfo8d419174450': no such file or direc
apparently tempfile()
doesn't create file.
> outfile <- tempfile("pdfinfo") > outfile [1] "c:\\users\\raffael\\appdata\\local\\temp\\rtmpquryx6\\pdfinfo8d437bd65d9"
the folder c:\users\raffael\appdata\local\temp\rtmpquryx6
exists , holds files none named pdfinfo8d437bd65d9
.
intersting, on machine after fresh start pdf
function convert image pdf:
getanywhere(pdf) single object matching ‘pdf’ found found in following places package:grdevices namespace:grdevices [etc.]
but problem of reading in pdf files text, fiddling path bit hit-and-miss (and annoying if work across several different computers), think simplest , safest method call pdf2text
using system
tony breyal describes here.
in case (note 2 sets of quotes):
system(paste('"c:/program files/xpdf64/pdftotext.exe"', '"c:/users/raffael/documents/17214.pdf"'), wait=false)
this extended *apply
function or loop if have many pdf files.
Comments
Post a Comment