Photo Alban Gonzalez @ Flickr |
As I have written before (There are no shortcuts) sometimes is good to return to the basics and take it as reflection and improvement of your knowledge. All the programming languages and tools have many options how it could be extended and used together with libraries, packages, and plugins. But do you know how it works inside, what's under the hood? How to setup properly the parameters to get the best results from such tools?
What I want to do today is continue with this idea, but not on the philosophical level as last time, but with concrete examples. I give you some hints which I have tried by myself.
Machine learning
What could you do with machine learning? Whenever you install Python or R language you get such libraries like NLTK or Scikit-learn for Python or e1071. But do you see under the hood? Do you know how it's implemented inside or it's just a black-box for you? Do you want to know? So, what you can do:
- Start with the implementation of the simplest machine learning methods by yourself. For instance, k-Nearest Neighborhoods is the really good example and you can do it in half an hour from scratch or Decision Tree method or even Random Forest method.
- Take a look how it's implemented in existing code, for example, you can take a look into NLTK Python code and learn how they invent particular process
Natural Language Processing (NLP)
The machine learning methods used for NLP are the similar one which you know already from the previous study of basics (kNN, Decision Tree, Random Forest) or they are quite extensive to implement them from scratch. The same it is with n-grams and textual document models, so what to do then:
- You will not most likely implement word2vec, but what you can implement simply is bag-of-words or with little bit more effort is possible to implement tf-idf.
- This code is also accessible in NLTK library where you can take a look and for instance, find that tf-idf is implemented with smoothing.
- With n-grams when you implement it (see my post Computing text conditional entropy with uni- and bi-grams in R and Python) you can play further and calculate pointwise mutual information and find the collocations and associations between words with that.
I will return to this topic whenever I have more experience to share, but if you will search you will definitely find similar sources about other topics of your interests such as Big Data, Visualization, Statistics, etc.