sobota 21. listopadu 2015

Bunnies, Dragons and the 'Normal' World: Central Limit Theorem

Statistics is difficult field, but with this video you can see under the hood of one of many things from statistics theory and it's fun.
So, enjoy it as I enjoyed it. And learn something.

středa 14. října 2015

Course: Exploratory Data Analysis

What do you imagine under Exploratory Data Analysis?

After a while I decided to continue with one of the first specializations on Coursera which I started more than year ago. This Data Science specialization contains 4th course with the same name as the topic which I was asking in opening question: Exploratory Data Analysis.

I was wondering how much effort I need to include and how much knowledge I can get. After 2 years (yes, anniversary of this blog is 18th of December 2015) with Data Science study path and more than 1 and half year with intensive usage of R language and couple of statistics and R courses. What can surprised me, right?

Nothing and something. I mean this both in positive way and will tell you why.

sobota 7. března 2015

Computing text conditional entropy with uni- and bi-grams in R and Python

During my first semester of PhD study I have implemented solution for computing conditional entropy over text where each word (including interpunction) was on separate line. I would like to share this example, because it is simple and good for get basics (information theory, R and Python programming).

Conditional entropy is defined as (source is Wikipedia):


Just note, this computation determine the conditional entropy of the word distribution in a text given the previous word.

According line 4 in previous formula, I have to compute p(x,y), which is the probability that at any position in the text you will find the word x followed immediately by the word y, and p(y|x), which is the probability that if word x occurs in the text then word y will follow.

Or according line 5 in formula, I can use probability p(x,y) twice and calculate p(x) which is the probability of single word appearance in the text.

úterý 3. března 2015

Course: Data Mining Specialization on Coursera

When Specializations started last year on Coursera I joined to Data Science Specialization and was really keen to do as much as I can. I did 3 courses in row so far and then have been stopped with my own activity about PhD study and also little bit disappointed by courses quality which really differ course by course.

Anyway with new year, and I think it was announced during November last year, I joined to first course under Data Mining Specialization - Pattern Discovery in Data Mining. With expectation of better quality than previous courses of Data Science Specialization and more information which I can learn from those courses. It already started from 9th of February, so lets have a more detailed look over it and I will tell you my first impression from it.

Data Mining Specialization is prepared and lead by people of University of Illinois at Urbana-Champaign and contains following courses which you can do in any order (they are independent):
  1. Pattern Discovery in Data Mining
  2. Text Retrieval and Search Engines
  3. Cluster Analysis in Data Mining
  4. Text Mining and Analytics
  5. Data Visualization
So far, each course have just 1 term offered (Feb 9th, Mar 16th, Apr 27th, Jun 8th, Jul 20th) so you can follow up order above month by month. Either for free or with verification payment of $49 (which is the same as in Data Science Specialization courses) which leads to certificate.

So, I am little bit disappointed that I cannot start with Text Mining, which is most interesting for me and need to wait for it 3 months, but it needs some time to be fully ready, I suppose.

At the beginning of Pattern Discovery course I need to say they were some lacks about organization and course preparation, but it gets better and better and after 2 weeks of study (you have always some days in advance for each week to finalize quiz) it is interesting study, but so far it seems to me, that I am missing some more details, some tough study going deep into problematic. Maybe it is given just by passing the quizzes and missing some implementation assignments.

Anyway I can strongly recommend to you to test it for free (you have couple of weeks before you need to pay for verified certification path) and then decide if you like it or not. Based on my experiences each teacher and each university make different courses. Different approach, presentation, tasks, quizzes and so on, so you need to get used to it.