středa 14. října 2015

Course: Exploratory Data Analysis

What do you imagine under Exploratory Data Analysis?

After a while I decided to continue with one of the first specializations on Coursera which I started more than year ago. This Data Science specialization contains 4th course with the same name as the topic which I was asking in opening question: Exploratory Data Analysis.

I was wondering how much effort I need to include and how much knowledge I can get. After 2 years (yes, anniversary of this blog is 18th of December 2015) with Data Science study path and more than 1 and half year with intensive usage of R language and couple of statistics and R courses. What can surprised me, right?

Nothing and something. I mean this both in positive way and will tell you why.

sobota 7. března 2015

Computing text conditional entropy with uni- and bi-grams in R and Python

During my first semester of PhD study I have implemented solution for computing conditional entropy over text where each word (including interpunction) was on separate line. I would like to share this example, because it is simple and good for get basics (information theory, R and Python programming).

Conditional entropy is defined as (source is Wikipedia):


Just note, this computation determine the conditional entropy of the word distribution in a text given the previous word.

According line 4 in previous formula, I have to compute p(x,y), which is the probability that at any position in the text you will find the word x followed immediately by the word y, and p(y|x), which is the probability that if word x occurs in the text then word y will follow.

Or according line 5 in formula, I can use probability p(x,y) twice and calculate p(x) which is the probability of single word appearance in the text.

úterý 3. března 2015

Course: Data Mining Specialization on Coursera

When Specializations started last year on Coursera I joined to Data Science Specialization and was really keen to do as much as I can. I did 3 courses in row so far and then have been stopped with my own activity about PhD study and also little bit disappointed by courses quality which really differ course by course.

Anyway with new year, and I think it was announced during November last year, I joined to first course under Data Mining Specialization - Pattern Discovery in Data Mining. With expectation of better quality than previous courses of Data Science Specialization and more information which I can learn from those courses. It already started from 9th of February, so lets have a more detailed look over it and I will tell you my first impression from it.

Data Mining Specialization is prepared and lead by people of University of Illinois at Urbana-Champaign and contains following courses which you can do in any order (they are independent):
  1. Pattern Discovery in Data Mining
  2. Text Retrieval and Search Engines
  3. Cluster Analysis in Data Mining
  4. Text Mining and Analytics
  5. Data Visualization
So far, each course have just 1 term offered (Feb 9th, Mar 16th, Apr 27th, Jun 8th, Jul 20th) so you can follow up order above month by month. Either for free or with verification payment of $49 (which is the same as in Data Science Specialization courses) which leads to certificate.

So, I am little bit disappointed that I cannot start with Text Mining, which is most interesting for me and need to wait for it 3 months, but it needs some time to be fully ready, I suppose.

At the beginning of Pattern Discovery course I need to say they were some lacks about organization and course preparation, but it gets better and better and after 2 weeks of study (you have always some days in advance for each week to finalize quiz) it is interesting study, but so far it seems to me, that I am missing some more details, some tough study going deep into problematic. Maybe it is given just by passing the quizzes and missing some implementation assignments.

Anyway I can strongly recommend to you to test it for free (you have couple of weeks before you need to pay for verified certification path) and then decide if you like it or not. Based on my experiences each teacher and each university make different courses. Different approach, presentation, tasks, quizzes and so on, so you need to get used to it.

středa 26. listopadu 2014

Killing me softly

The title say everything. This killing combination come with project in Sweden together with previously started PhD study and my spare time study during evenings and sleepless nights, but lets take it from beginning.

At the beginning there was sustainable project with daily routine and spring. Part of the year with the most energy. So, I applied and got into PhD study. Euphoria was the mood which I had directly when I got letter with acceptation.

Then I started study and not realized how much things I need to go trough so I started with two subjects in one semester. So far so good. Lot of study materials, I can't say no. But, really interesting topics (I will blog about it sometimes later).

"Winter is coming."

- the motto of House Stark

And then I started to work on project in Sweden. And this started to be more interesting. I am not able to study on daily basis, so I am little bit behind schedule, but I am really tired from winter and dark (at 7 AM is not daylight yet and at 2 PM is no longer daylight).

So, even when I put into it big discipline and lot of effort I have more things to do. And I hope I will handle it at least partially or with some postponements. And will finalize first semester somehow reasonable.

sobota 4. října 2014

Introduction to fitness band experiment

This started as stupid idea. It was influenced by someone's project in Coursera course (Data Analysis and Statistical Inference), when I did volunteer evaluation. The other student analyzed his own data from Fitbit Flex . I almost forgot on it. Then later on in another Coursera course (Getting and Cleaning Data from Data Science Specialization) we did data processing in R and this data contained also information from accelerometer and some experiment with couple of human subjects (see Human Activity Recognition Using Smartphones Data Set).

Then continued when I was discussing my dissertation thesis topic with my supervisor and we came to the experiment with fitness bands and sentiment written into twitter.

pondělí 29. září 2014

Big Data and Data Science study - presentation

During our Friday 26th of September company meeting I talked about topic in title with presentation. I think this presentation could be useful not only for audience which was there but for anyone.

Here is description what is content of presentation about:
Big Data and Data Science study with subtitle "study materials and online courses" is little bit more over 40 slides presentation about 10 domains of Data Science covered by online free and paid MOOC courses, study materials and free books.

Based on my almost year of study, investigate and collect of materials, tutorials, courses, books, links, etc. I have prepared distillation of the best in this short presentation.

Of course list is not full, because there is always something new, undiscovered and better than before. But it contains the most important information for those who want to start or don't know where exactly follow up when they already begun.

Follow up to the Speaker Deck site and you can download presentation as PDF which is quite useful when you consider functional links over all presentation.


neděle 31. srpna 2014

Data Science and PhD study

When I was applying for the PhD study I was thinking about it as opportunity to have chance to use all techniques, programming languages, methods, processes and technologies in practical way and delivery something reasonable which supports my professional growth and development. And it could stands as proof that I am really keen to learn and improve in this field and I have passion to do Data Science.

I was thinking about it again when I have read this article The Modern Data Nerd Isn’t as Nerdy as You Think on Wired about Data Nerds who are currently in charge of Data Science departments or in similar role and apply their experiences and knowledge and usually they do not have PhD. More over they do not have master degree, just bachelor degree. Or in case they have any kind of degree it is not from field close to Data Science.