sobota 7. března 2015

Computing text conditional entropy with uni- and bi-grams in R and Python

During my first semester of PhD study I have implemented solution for computing conditional entropy over text where each word (including interpunction) was on separate line. I would like to share this example, because it is simple and good for get basics (information theory, R and Python programming).

Conditional entropy is defined as (source is Wikipedia):


Just note, this computation determine the conditional entropy of the word distribution in a text given the previous word.

According line 4 in previous formula, I have to compute p(x,y), which is the probability that at any position in the text you will find the word x followed immediately by the word y, and p(y|x), which is the probability that if word x occurs in the text then word y will follow.

Or according line 5 in formula, I can use probability p(x,y) twice and calculate p(x) which is the probability of single word appearance in the text.


Uni- and bi-gram

Before I show implementation let me do the detour and focus on necessary part of computing which is creation of uni- and bi-grams. How the n-grams look like is show on next picture:

So, according to previous definition I need to calculate on uni- and bi-grams following probabilities:
  • probability p(x) equals p(wi), 
  • joined probability p(x,y) then p(wi-1,wi
  • and conditional probability p(y|x) is p(wi|wi-1)

Implementation

I decided to solve it in R language and later I have implemented the same in Python. Reason is simple, R is not performing well when processing big amount of data.

R language

I wrote following function which deals with creation of uni-gram, bi-gram from source data and calculate words and pairs frequency and after merge the final entropy.

condentropy <- function(data){
  len_data <- dim(data)[1]
  
  unigram <- data.frame(w2 = data[1:len_data,], cnt = 1)
  unigram_agg <- aggregate(cnt ~ w2, data = unigram, FUN = length)
  
  bigram <- data.frame(w1 = data[1:len_data-1,],w2 = data[2:len_data,], cnt = 1)
  bigram_agg <- aggregate(cnt ~ w1 + w2, data = bigram, FUN = length)
  
  ub <- merge(unigram_agg, bigram_agg, by.x = "w2", by.y = "w2", all = T)
  
  h_rec <- data.frame(w1 = ub$w1, w2 = ub$w2, 
                      uni.cnt = ub$cnt.x, bi.cnt = ub$cnt.y, 
                      h = -((ub$cnt.y / len_data) * log2((ub$cnt.y / len_data) / (ub$cnt.x / len_data) ) ), id = 1)
  h <- aggregate(h ~ id, data = h_rec, FUN = sum)[2]
  
  return(h)
}

Python

In Python I have separated task into couple of steps, first I have created function for n-gram creation. Then I can calculate frequency in different function and finally conditional entropy:

def ngram_list(text,n):  
    ngram=[]  
    count=0  
    for token in text[:len(text)-n+1]:  
        ngram.append(text[count:count+n])  
        count=count+1  
    return ngram

def ngram_counts(text, n):
    ngram_d = {}
    ngram_l = ngram_list(text, n)

    for item in ngram_l:
        ngram_d[' '.join(item)] = (ngram_d[' '.join(item)] + 1) if ' '.join(item) in ngram_d else 1 
    return ngram_d

def condentropy(data):
    uni_gram = npfl068.ngram_counts(data, 1)
    bi_gram = npfl068.ngram_counts(data, 2)
    N = sum(uni_gram.values())
    H = 0    
    
    for key in bi_gram.keys():
        H -= bi_gram[key] / (1.0 * N) * math.log(bi_gram[key] / (1.0 * uni_gram[key.split(' ')[1]]),2)
    
    return H

So, implementation is may be more straight forward in R, but not so much effective (in Python I have universal methods for n-gram which I would need to implement in R) and performing well (in case of merge uni-, bi- and tri-gram with bigger data consumes enormous part of memory) as in Python.

And finally I can imagine that code can be written much more effective with better experience in R or Python, so comments are welcome!

Žádné komentáře:

Okomentovat