machine learning

machine learning materials

Here I am writing a collection of notes I have about data science and machine learning. I am starting with the approach used by Guilherme Silveira in a short course (in Portuguese).

Understanding data science

A good way to understand is by a simple example. Let us say we are observing some animals in a farm that has pigs and dogs. In this observation, let us say we can easily check three features of any animal: if it has long fur, has short leg and says woof. We will use zeros and ones to describe each feature. In supervised learning we have to give a set of samples to the machine and also give the right answer, in our case, we have to say if each of the samples is a pig or a dog.

In this notebook, we can see the very basic approach for this problem which is a classification problem, since we want to tell if an animal is of the category of dogs or of pigs. Our dataset has six animals, three are pigs and three are dogs. Only one of the observed pigs has long fur, all of them has short legs and one was observed woofing:pigs = [[0,1,0],[0,1,1],[1,1,0]]. In case of dogs, one of has not long, one of them has not short legs and all of said woof: dogs = [[0,1,1],[1,0,1],[1,1,1]]. In the given notebook we have trained our machine and asked to predict which animal a [1,1,1] should be.

In the next notebook, we have started with same dataset and training the machine. But we gave a small dataset of new observations in which we know the right category for each sample. The objective was to calculate the accuracy of our model.

With this simple application we can see the potential of machine learning. In this case, for a simple categorization between two species of animals. But it could be for any animal which is sometimes even for human hard to tell the exactly specie or could be anything else like to tell if an email is potentially risky or not, if the patient has some disease of not or if a customer is likely to buy or not.

Working with a bit larger dataset

In this another notebook we are using pandas to retrieve a dataset from a gist file. The data is in CSV format. It regards to a set of records simulating customers that have visited a website. It joins information of each page the customer has visited and if this customer has bought a product or not. The idea is to build a model to predict from the pages a user visit if he or she is likely to buy or product or not.

It is not too different from the example of pigs and dogs, we just have a larger dataset. The point of this example is to show that we need to split of dataset into train and test in order to calculate its accuracy.

Diving deeper on data science

The final notebook of this series is about data visualization.

This list of exercises shows a more comprehensive example. It regards to a dataset of a survey done with students of a hypothetical University. The resolution using python and R is available. Another list has more data exploitation.

I did a few coding by my own based in the inspiring works published at covid-19 dashboards. Here a previous study comparing trajectories of covid-19 cases and deaths by Brazilian state using Brazilian Ministry of Health original data. Unfortunately I have abandoned this version because I did not find a way to do automated updates. In order to see updated data it is necessary to fork the repository and update the file arquivo_geral.csv manually. This latest version is the same comparison of trajectories of covid-19 cases and deaths by Brazilian states with some extra data adding mean, median and total. It uses a permanent link to get new data provided by brasil.io which should work for a while. Still, there is a second study using linear scale just for new cases. It is "almost" using fastpages as suggested in contributing documentation, but it does not bring any relevant contribution yet. I was planning to show balloons or some special mark to tell the exactly data that some government action was taken, trying to check some causality relation. Another nice thing to do, which should be easy, is to make regions using different colour pallets.

Some other references

Python

Statistical databases

Famous quotes

Last updated