sobota 7. března 2015

Computing text conditional entropy with uni- and bi-grams in R and Python

During my first semester of PhD study I have implemented solution for computing conditional entropy over text where each word (including interpunction) was on separate line. I would like to share this example, because it is simple and good for get basics (information theory, R and Python programming).

Conditional entropy is defined as (source is Wikipedia):


Just note, this computation determine the conditional entropy of the word distribution in a text given the previous word.

According line 4 in previous formula, I have to compute p(x,y), which is the probability that at any position in the text you will find the word x followed immediately by the word y, and p(y|x), which is the probability that if word x occurs in the text then word y will follow.

Or according line 5 in formula, I can use probability p(x,y) twice and calculate p(x) which is the probability of single word appearance in the text.

úterý 3. března 2015

Course: Data Mining Specialization on Coursera

When Specializations started last year on Coursera I joined to Data Science Specialization and was really keen to do as much as I can. I did 3 courses in row so far and then have been stopped with my own activity about PhD study and also little bit disappointed by courses quality which really differ course by course.

Anyway with new year, and I think it was announced during November last year, I joined to first course under Data Mining Specialization - Pattern Discovery in Data Mining. With expectation of better quality than previous courses of Data Science Specialization and more information which I can learn from those courses. It already started from 9th of February, so lets have a more detailed look over it and I will tell you my first impression from it.

Data Mining Specialization is prepared and lead by people of University of Illinois at Urbana-Champaign and contains following courses which you can do in any order (they are independent):
  1. Pattern Discovery in Data Mining
  2. Text Retrieval and Search Engines
  3. Cluster Analysis in Data Mining
  4. Text Mining and Analytics
  5. Data Visualization
So far, each course have just 1 term offered (Feb 9th, Mar 16th, Apr 27th, Jun 8th, Jul 20th) so you can follow up order above month by month. Either for free or with verification payment of $49 (which is the same as in Data Science Specialization courses) which leads to certificate.

So, I am little bit disappointed that I cannot start with Text Mining, which is most interesting for me and need to wait for it 3 months, but it needs some time to be fully ready, I suppose.

At the beginning of Pattern Discovery course I need to say they were some lacks about organization and course preparation, but it gets better and better and after 2 weeks of study (you have always some days in advance for each week to finalize quiz) it is interesting study, but so far it seems to me, that I am missing some more details, some tough study going deep into problematic. Maybe it is given just by passing the quizzes and missing some implementation assignments.

Anyway I can strongly recommend to you to test it for free (you have couple of weeks before you need to pay for verified certification path) and then decide if you like it or not. Based on my experiences each teacher and each university make different courses. Different approach, presentation, tasks, quizzes and so on, so you need to get used to it.