Data Science Padawan: března 2017

pondělí 27. března 2017

Back to the basics

As I have written before (There are no shortcuts) sometimes is good to return to the basics and take it as reflection and improvement of your knowledge. All the programming languages and tools have many options how it could be extended and used together with libraries, packages, and plugins. But do you know how it works inside, what's under the hood? How to setup properly the parameters to get the best results from such tools?

What I want to do today is continue with this idea, but not on the philosophical level as last time, but with concrete examples. I give you some hints which I have tried by myself.

Machine learning

What could you do with machine learning? Whenever you install Python or R language you get such libraries like NLTK or Scikit-learn for Python or e1071. But do you see under the hood? Do you know how it's implemented inside or it's just a black-box for you? Do you want to know? So, what you can do:

Start with the implementation of the simplest machine learning methods by yourself. For instance, k-Nearest Neighborhoods is the really good example and you can do it in half an hour from scratch or Decision Tree method or even Random Forest method.
Take a look how it's implemented in existing code, for example, you can take a look into NLTK Python code and learn how they invent particular process

Natural Language Processing (NLP)

The machine learning methods used for NLP are the similar one which you know already from the previous study of basics (kNN, Decision Tree, Random Forest) or they are quite extensive to implement them from scratch. The same it is with n-grams and textual document models, so what to do then:

You will not most likely implement word2vec, but what you can implement simply is bag-of-words or with little bit more effort is possible to implement tf-idf.
This code is also accessible in NLTK library where you can take a look and for instance, find that tf-idf is implemented with smoothing.
With n-grams when you implement it (see my post Computing text conditional entropy with uni- and bi-grams in R and Python) you can play further and calculate pointwise mutual information and find the collocations and associations between words with that.

I will return to this topic whenever I have more experience to share, but if you will search you will definitely find similar sources about other topics of your interests such as Big Data, Visualization, Statistics, etc.

čtvrtek 23. března 2017

Luxury of the study compare to the work

On the way back from work to home, I just realized the luxury which I have during the study. I don't need to deliver a final and functional product I am not tight up with deadlines and budget. I am not technology locked up based on customers decision.

I can switch the technologies as I need and wish. I can play with them, experiment with them. Commit the output to versioning control repository or not. Freely publish the data, code, share the ideas on the blog and finally write it all on paper wrap it up by conference or magazines template and publish.

And when I realized it I want to share my approach with my last two publications, because they are connected and it would be end-to-end process description with two publications and several Github repositories as output.

One picture is better than 1000 words so let's imagine the following situation about what I have done during last year:

Další informace »

neděle 19. března 2017

Continuous and discrete time series visualized together

The second publication for Gigascience is nearly finished and the deadline for third publication is approaching really quickly so I needed to start with data analysis. Because I had just a few ideas about how to analyze experiment data I asked the colleague from University about advice and sent her data. It was a few months ago.

Recently I got back from her the ideas and notes and also the recommendation for the book about time series analysis. As usual, it's good to start with several visualizations and from simple things to complex. So, I want to share some notes about it.

Data

As the output from two parts of the experiment were recorded two pairs of datasets for:

Experiment #1 with Fitbit Charge HR

1029 tweets with average of 20.56 tweets per day,
411 799 records of HR with frequency of 6 - 7 records per minute

Experiment #2 with Peak Basis

1017 tweets with average of 20.32 tweets per day,
69 909 records of HR with frequency of 1 record per minute

All the tweets are just a records of sentiment and they are evaluated by hashtags #p for positive and #n for negative sentiment by experiment participant. As a part of publication, the sentiment will be extracted via machine learning methods and compared to human evaluation.

Visualization

I took data from experiment #1 since there are more records of HR and did following data wrangling:

Tweets respectively evaluated sentiment needs to be extended to not represent just a point in time, but the whole window. For this, we need to define "breaking point" between two consecutive going sentiments. It's simply in the middle. This is visualized in the following figure as the gray line with the dots on it where positive sentiment = 150 of HR bpm and negative sentiment = 0 of HR bpm (scale adjustment was done, otherwise sentiment is represented by +1 and -1). Dot's here representing true records, the line is an extrapolation to the time window.
Heart rate is drawn with another gray line representing rapid changes over time. I have applied to it simple moving average (SMA) method to get a much smoother line for further analysis. And also cut this SMA line by extrapolated windows of sentiment where the red part represents negative sentiment and blue part positive sentiment.

Here is the result:

It looks really promising, but that's all so far. I have few ideas how to continue, but since this is a part of my third publication I will publish it first and then link the publication itself and also complete code on Github. Sharing this was just about the idea how to start with the visualization of continuous and discrete time series together.