r - How to implement proximity rules in tm dictionary for counting words? -
objective
i count number of times word "love" appears in documents if isn't preceded word 'not' e.g. "i love films" count 1 appearance whilst "i not love films" not count appearance.
question
how 1 proceed using tm package?
r code
below self contained code modify above.
require(tm) # text vector my.docs <- c(" love red hot chilli peppers! lovely people in world.", "i not love red hot chilli peppers not hate them either. think ok.\n", "i hate `red hot chilli peppers`!") # convert data.frame my.docs.df <- data.frame(docs = my.docs, row.names = c("positivetext", "neutraltext", "negativetext"), stringsasfactors = false) # convert corpus my.corpus <- corpus(dataframesource(my.docs.df)) # standard preprocessing my.corpus <- tm_map(my.corpus, stripwhitespace) my.corpus <- tm_map(my.corpus, tolower) my.corpus <- tm_map(my.corpus, removepunctuation) my.corpus <- tm_map(my.corpus, removewords, stopwords("english")) my.corpus <- tm_map(my.corpus, stemdocument) my.corpus <- tm_map(my.corpus, removenumbers) # construct dictionary my.dictionary.terms <- tolower(c("love", "hate")) my.dictionary <- dictionary(my.dictionary.terms) # construct term document matrix my.tdm <- termdocumentmatrix(my.corpus, control = list(dictionary = my.dictionary)) inspect(my.tdm) # terms positivetext neutraltext negativetext # hate 0 1 1 # love 2 1 0
further information
i trying reproduce dictionary rules functionality commercial package wordstat. able make use of dictionary rules i.e.
"hierarchical content analysis dictionaries or taxonomies composed of words, word patterns, phrases proximity rules (such near, after, before) achieving precise measurement of concepts"
also noticed interesting question: open-source rule-based pattern matching / information extraction frameworks?
update 1: based on @ben's comment , post got (although different @ end inspired answer full credit him)
require(data.table) require(rweka) # bi-gram tokeniser function bigramtokenizer <- function(x) ngramtokenizer(x, weka_control(min = 1, max = 2)) # 1-gram , 2-gram word counts tdm <- termdocumentmatrix(my.corpus, control = list(tokenize = bigramtokenizer)) # convert data.table dt <- as.data.table(as.data.frame(as.matrix(tdm)), keep.rownames=true) setkey(dt, rn) # attempt @ extracting includes overlaps i.e. words counted twice dt[like(rn, "love")] # rn positivetext neutraltext negativetext # 1: love 1 0 0 # 2: love 2 1 0 # 3: love peopl 1 0 0 # 4: love 1 1 0 # 5: love 1 0 0 # 6: not love 0 1 0
then guess need row sub-setting , row subtraction lead
dt1 <- dt["love"] # rn positivetext neutraltext negativetext #1: love 2 1 0 dt2 <- dt[like(rn, "love") & like(rn, "not")] # rn positivetext neutraltext negativetext #1: not love 0 1 0 # somehow # dt = dt1 - dt2 # can't work out how code require output # rn positivetext neutraltext negativetext #1: love 2 0 0
i don't know how last line using data.table approach akin wordstats 'not near' dictionary function e.g. in case count word "love" if deesn't appear within 1-word either directly before or directly after word 'not'.
if m-gram tokeniser saying count word "love" if doesn't appear within (m-1)-words either side of word "not".
other approaches welcome!
this interesting question collocation extraction, doesn't seem built packages (except this one, not on cran or github though), despite how popular in corpus linguistics. think code answer question, there might more general solution this.
here's example (thanks easy use example)
############## require(tm) # text vector my.docs <- c(" love red hot chilli peppers! lovely people in world.", "i not `love` red hot chilli peppers not hate them either. think ok.\n", "i hate `red hot chilli peppers`!") # convert data.frame my.docs.df <- data.frame(docs = my.docs, row.names = c("positivetext", "neutraltext", "negativetext"), stringsasfactors = false) # convert corpus my.corpus <- corpus(dataframesource(my.docs.df)) # standard preprocessing my.corpus <- tm_map(my.corpus, stripwhitespace) my.corpus <- tm_map(my.corpus, tolower) my.corpus <- tm_map(my.corpus, removepunctuation) # 'not' stopword let's not remove # my.corpus <- tm_map(my.corpus, removewords, stopwords("english")) my.corpus <- tm_map(my.corpus, stemdocument) my.corpus <- tm_map(my.corpus, removenumbers) # construct dictionary - not used in case # my.dictionary.terms <- tolower(c("love", "hate")) # my.dictionary <- dictionary(my.dictionary.terms)
here's suggestion, making document term matrix of bigrams , subsetting them
#tokenizer n-grams , passed on term-document matrix constructor library(rweka) bigramtokenizer <- function(x) ngramtokenizer(x, weka_control(min = 2, max = 2)) txttdmbi <- termdocumentmatrix(my.corpus, control = list(tokenize = bigramtokenizer)) inspect(txttdmbi) # find bigrams have 'love' in them love_bigrams <- txttdmbi$dimnames$terms[grep("love", txttdmbi$dimnames$terms)] # keep bigrams 'love' not first word # avoid counting 'love' twice , can subset # based on preceeding word require(hmisc) love_bigrams <- love_bigrams[sapply(love_bigrams, function(i) first.word(i)) != 'love'] # exclude specific bigram 'not love' love_bigrams <- love_bigrams[!love_bigrams == 'not love']
and here's result, count of 2 'love', has excluded 'not love' bigram.
# inspect results inspect(txttdmbi[love_bigrams]) term-document matrix (2 terms, 3 documents) non-/sparse entries: 2/4 sparsity : 67% maximal term length: 9 weighting : term frequency (tf) docs terms positivetext neutraltext negativetext love 1 0 0 love 1 0 0 # counts of 'love' (excluding 'not love') colsums(as.matrix(txttdmbi[love_bigrams])) positivetext neutraltext negativetext 2 0 0
Comments
Post a Comment