Data Science Padawan

pátek 6. října 2017

Chasing own tail

It's a depressive moment in everyone's life, chasing own tail. What this about in my case? When I want to move in career towards data science role. Not a big data, i.e., working with Hadoop, but real data science which is about statistics, data mining, data analysis, looking for patterns, feature extraction and so on. The issue is the experience on some project.

Yes you most likely know it's this pattern:

There is no job without project experience
You can't get project experience without job

Oh, wait maybe I can break this infinite loop. How?

Další informace »

pondělí 27. března 2017

Back to the basics

Photo Alban Gonzalez @ Flickr

As I have written before (There are no shortcuts) sometimes is good to return to the basics and take it as reflection and improvement of your knowledge. All the programming languages and tools have many options how it could be extended and used together with libraries, packages, and plugins. But do you know how it works inside, what's under the hood? How to setup properly the parameters to get the best results from such tools?

What I want to do today is continue with this idea, but not on the philosophical level as last time, but with concrete examples. I give you some hints which I have tried by myself.

Machine learning

What could you do with machine learning? Whenever you install Python or R language you get such libraries like NLTK or Scikit-learn for Python or e1071. But do you see under the hood? Do you know how it's implemented inside or it's just a black-box for you? Do you want to know? So, what you can do:

Start with the implementation of the simplest machine learning methods by yourself. For instance, k-Nearest Neighborhoods is the really good example and you can do it in half an hour from scratch or Decision Tree method or even Random Forest method.
Take a look how it's implemented in existing code, for example, you can take a look into NLTK Python code and learn how they invent particular process

Natural Language Processing (NLP)

The machine learning methods used for NLP are the similar one which you know already from the previous study of basics (kNN, Decision Tree, Random Forest) or they are quite extensive to implement them from scratch. The same it is with n-grams and textual document models, so what to do then:

You will not most likely implement word2vec, but what you can implement simply is bag-of-words or with little bit more effort is possible to implement tf-idf.
This code is also accessible in NLTK library where you can take a look and for instance, find that tf-idf is implemented with smoothing.
With n-grams when you implement it (see my post Computing text conditional entropy with uni- and bi-grams in R and Python) you can play further and calculate pointwise mutual information and find the collocations and associations between words with that.

I will return to this topic whenever I have more experience to share, but if you will search you will definitely find similar sources about other topics of your interests such as Big Data, Visualization, Statistics, etc.

čtvrtek 23. března 2017

Luxury of the study compare to the work

On the way back from work to home, I just realized the luxury which I have during the study. I don't need to deliver a final and functional product I am not tight up with deadlines and budget. I am not technology locked up based on customers decision.

I can switch the technologies as I need and wish. I can play with them, experiment with them. Commit the output to versioning control repository or not. Freely publish the data, code, share the ideas on the blog and finally write it all on paper wrap it up by conference or magazines template and publish.

And when I realized it I want to share my approach with my last two publications, because they are connected and it would be end-to-end process description with two publications and several Github repositories as output.

One picture is better than 1000 words so let's imagine the following situation about what I have done during last year:

Další informace »

neděle 19. března 2017

Continuous and discrete time series visualized together

The second publication for Gigascience is nearly finished and the deadline for third publication is approaching really quickly so I needed to start with data analysis. Because I had just a few ideas about how to analyze experiment data I asked the colleague from University about advice and sent her data. It was a few months ago.

Recently I got back from her the ideas and notes and also the recommendation for the book about time series analysis. As usual, it's good to start with several visualizations and from simple things to complex. So, I want to share some notes about it.

Data

As the output from two parts of the experiment were recorded two pairs of datasets for:

Experiment #1 with Fitbit Charge HR

1029 tweets with average of 20.56 tweets per day,
411 799 records of HR with frequency of 6 - 7 records per minute

Experiment #2 with Peak Basis

1017 tweets with average of 20.32 tweets per day,
69 909 records of HR with frequency of 1 record per minute

All the tweets are just a records of sentiment and they are evaluated by hashtags #p for positive and #n for negative sentiment by experiment participant. As a part of publication, the sentiment will be extracted via machine learning methods and compared to human evaluation.

Visualization

I took data from experiment #1 since there are more records of HR and did following data wrangling:

Tweets respectively evaluated sentiment needs to be extended to not represent just a point in time, but the whole window. For this, we need to define "breaking point" between two consecutive going sentiments. It's simply in the middle. This is visualized in the following figure as the gray line with the dots on it where positive sentiment = 150 of HR bpm and negative sentiment = 0 of HR bpm (scale adjustment was done, otherwise sentiment is represented by +1 and -1). Dot's here representing true records, the line is an extrapolation to the time window.
Heart rate is drawn with another gray line representing rapid changes over time. I have applied to it simple moving average (SMA) method to get a much smoother line for further analysis. And also cut this SMA line by extrapolated windows of sentiment where the red part represents negative sentiment and blue part positive sentiment.

Here is the result:

It looks really promising, but that's all so far. I have few ideas how to continue, but since this is a part of my third publication I will publish it first and then link the publication itself and also complete code on Github. Sharing this was just about the idea how to start with the visualization of continuous and discrete time series together.

pátek 14. října 2016

There are no shortcuts

During last 3 months I have realized it is really difficult to progress with something when you don't have enough knowledge and you are trying to go ahead without proper basics and good background, because you don't have time to focus on them, rather jump directly to wild water and swim. What a mistake.

I was dealing with many things simultaneously. I needed to start with second publication for my PhD study, but I worked on second experiment which needed to be included into this publication. And of course I studied two difficult and time consuming courses in the same time on Coursera and elsewhere. All of this together with big workload in work and push from my company to learn German. To much to process, to much work to do.

Problem is, nobody can control dreams you have.
There is a reason why Magical Realism has been born in Columbia.
It's a country wheres dreams and reality are conflated.
Where in the head, people fly high as Icarus.
But even Magical Realism has it's limits
and when you get to close to the sun...
your dreams may melt away.

- Narcos

I haven't fly so high as Icarus yet. I haven't fall down and let my dreams melt away yet. But I have been close. My PhD was running away from me. I haven't finish courses.

So, there are no shortcuts. Basic need was to organize and finish all of those tasks one by one with proper priority and timing. This helped me out of the bad mantra I am too busy for basics, but I cannot manage the complex things because of lack of basics.

And I returned to basics (learn particular technique or setup, install and explore new technology) at least for once a week instead of work on my PhD or watching lot a videos from MOOC courses I am focusing on small parts with hands on experience.

neděle 3. července 2016

Heart rate and sentiment experiment design

Photo GrejGuide.dk @ Flickr

After I finished first experiment and publish article in HEALTHINF 2016 conference this year in Rome I started thinking about next experiment design.

As I mentioned in previous paper improvements I tried to get a lesson from previous mistakes and improve a lot. First, steps are increasing during the day and thus they are not so much independent. Better would be to use heart rate because is totally independent. Second, I can improve my records about sentiment in timing and evaluation. And last but least important is sentiment extraction, instead of supervised learning used in previous work I would like to used unsupervised classification.

So, let's get to details.

Experiment Design

We are still looking for relation between soft data (sentiment) and hard data (measurand). In first experiment it was text recorded via twitter and footsteps. This time it's again text recorded via twitter, but instead footsteps it's heart rate which is more idenpendent.

What's are the main objectives:

One month experiment (30 days)
20 tweets per day (600 tweets minimum)
Continuous heart rate measurement (24/7)
Effective time for heart rate measurement between 7 and 23, i. e. 16 hours a day, rest used for wristband charging
No sleep activity monitoring
Steps monitoring? Perhaps.

Další informace »

středa 15. června 2016

Result of fitness band experiment

As I am progressing with my PhD study I haven't been able to write down any article because I have been busy. And not only PhD made me busy for whole time since January this year, I have to care about work and family.

Nevertheless I should point out results from Introduction fitness band experiment when I wrote about it one and half year ago and never continued.

What was the experiment about

To find relation between sentiment (represented by text recorded in twitter just for practical reasons) and human activity represented by footsteps. Practically about finding link between soft data - sentiment - and hard, measured data.

Reading data: Github implementations

Data and it's processing and analysis code I will publish another time, because I don't have them yet at GitHub.

Twitter data reading implementation

What I have is implementation or reading tweets from Twitter API through tweepy. Which is refactored original version. The reason of refactoring is that I work on second version of fitness band experiment.

Jawbone data reading implementation

What I also have is implementation of reading data from Jawbone API through many different libraries. It's small mess which I need to clean up later when I will use Jawbone again.

Story telling

Long story short

All the data has been extracted, processed and was defined hypothesis which wasn't rejected. Unfortunately, rejection of null hypothesis in favor of alternative was expected and it doesn't happened. That's result and that's the long story short. The whole article presented on HEALTHINF 2016 conference in February this year (2016) is possible to get here:

You can also follow up with the whole story in following chapter, if your are not interested in paper it self right now.

Další informace »