Conditional entropy is defined as (source is Wikipedia):
Just note, this computation determine the conditional entropy of the word distribution in a text given the previous word.
According line 4 in previous formula, I have to compute p(x,y), which is the probability that at any position in the text you will find the word x followed immediately by the word y, and p(y|x), which is the probability that if word x occurs in the text then word y will follow.
Or according line 5 in formula, I can use probability p(x,y) twice and calculate p(x) which is the probability of single word appearance in the text.
Uni- and bi-gram
Before I show implementation let me do the detour and focus on necessary part of computing which is creation of uni- and bi-grams. How the n-grams look like is show on next picture:So, according to previous definition I need to calculate on uni- and bi-grams following probabilities:
- probability p(x) equals p(wi),
- joined probability p(x,y) then p(wi-1,wi)
- and conditional probability p(y|x) is p(wi|wi-1)
Implementation
I decided to solve it in R language and later I have implemented the same in Python. Reason is simple, R is not performing well when processing big amount of data.R language
I wrote following function which deals with creation of uni-gram, bi-gram from source data and calculate words and pairs frequency and after merge the final entropy.
condentropy <- function(data){ len_data <- dim(data)[1] unigram <- data.frame(w2 = data[1:len_data,], cnt = 1) unigram_agg <- aggregate(cnt ~ w2, data = unigram, FUN = length) bigram <- data.frame(w1 = data[1:len_data-1,],w2 = data[2:len_data,], cnt = 1) bigram_agg <- aggregate(cnt ~ w1 + w2, data = bigram, FUN = length) ub <- merge(unigram_agg, bigram_agg, by.x = "w2", by.y = "w2", all = T) h_rec <- data.frame(w1 = ub$w1, w2 = ub$w2, uni.cnt = ub$cnt.x, bi.cnt = ub$cnt.y, h = -((ub$cnt.y / len_data) * log2((ub$cnt.y / len_data) / (ub$cnt.x / len_data) ) ), id = 1) h <- aggregate(h ~ id, data = h_rec, FUN = sum)[2] return(h) }
Python
In Python I have separated task into couple of steps, first I have created function for n-gram creation. Then I can calculate frequency in different function and finally conditional entropy:
def ngram_list(text,n): ngram=[] count=0 for token in text[:len(text)-n+1]: ngram.append(text[count:count+n]) count=count+1 return ngram def ngram_counts(text, n): ngram_d = {} ngram_l = ngram_list(text, n) for item in ngram_l: ngram_d[' '.join(item)] = (ngram_d[' '.join(item)] + 1) if ' '.join(item) in ngram_d else 1 return ngram_d def condentropy(data): uni_gram = npfl068.ngram_counts(data, 1) bi_gram = npfl068.ngram_counts(data, 2) N = sum(uni_gram.values()) H = 0 for key in bi_gram.keys(): H -= bi_gram[key] / (1.0 * N) * math.log(bi_gram[key] / (1.0 * uni_gram[key.split(' ')[1]]),2) return H
So, implementation is may be more straight forward in R, but not so much effective (in Python I have universal methods for n-gram which I would need to implement in R) and performing well (in case of merge uni-, bi- and tri-gram with bigger data consumes enormous part of memory) as in Python.
And finally I can imagine that code can be written much more effective with better experience in R or Python, so comments are welcome!
Žádné komentáře:
Okomentovat