Today we’re gonna learn how to prepare a dataset before feeding it into our machine learning model. Our example code will predict if someone is meditating or not. By training on a dataset of brain scans.

Data is raw information. It’s a representation of both human and machine observations of the world.

Everything can be represented as data. All signs, art, literature all of it can be represented as 1s and 0s on a computer. When we enter a virtual world, we are literally surrounded by data since it is the fundamental building block of everything we see. And when we observe something physical in real life, it becomes data in our brain. Unless our universe is a simulation, then

If you don’t work at a tech giant, how are you supposed to get data.that brings us to step one in preparing data.

Deciding the right kind of data to use.

Select the Data

The dataset you use entirely depends on the problem you are trying to solve. If I wanna build a chatbot that comes up with new innovative product ideas, Im not gonna use a dataset of Tim Cook dialog. Data is a means to an end and the good news is there is a public dataset for almost anything topic you can think about. A couple of sites I like to use to find cool datasets are:

Kaggle, since I love the format of there website and how they explain each of there datasets in detail. Also the dataset’s subReddit is great for requesting datasets you want. And there is this awesome list of datasets in the readme of this Github repo that I’ll leave in the attachments below.

Google’s advanced search feature is also super helpful. Usually combining a few keywords with the word data, or database is enough to find what we need. And to make it easier we can specify the type of file we want like .csv, and the type of domain like .edu or . gov.

Usually a website has an API that makes it easier to get the data you need. But if it doesn’t you can use a library like Beautifulsoup to take a raw HTML webpage and just scrape the data directly. DIY data.. DIYD

Once we decide the type of data we want, our second step is to process it.

Process Data

We are gonna write a function to extract data from a brain scan dataset then we feed that data into a single layer Neural Network created in Tensorflow. Then network will create a separator line between two classes so that given a new person’s brainscan data it can predict that if they are meditating or not.

Let’s take a look at our data.

This is a list of neurological matrix collected via a EEG device for a set of human volunteers. There are two possible classes. Either meditating represented by 1 or not meditating represented by a 0. And there are three features for this data. A Measure of Mental Focus, a measure of Calmness and the Volunteer’s gender. We want to format our data properly.

Data can come in a form of a text file or a relational database or like what we have a CSV. and there is a library that converts pretty much any file type to another so make sure you have your data formatted into a file type that you feel most comfortable with. Once it is in the right format we want to clean data.

Sometimes we have instances in our data that are incomplete. We can iterate through each row and delete those instances by checking if the value is empty or not. We should also decide what feature to use.

Deciding Features

Deciding what features are important it’s one of the key part of data science. If we don’t use the right features our model will make bad predictions. We only want to use features that are relevant to our problem. For example, in our case the gender of the volunteers has nothing to do with there meditative state so we can totally disregard that feature.

So let’s first create two arrays. on e array would hold our class labels. The other array would hold our features. We can iterate through every line in our csv file using this for loop. We will define a row which is a single instance as an array of values by splitting the line by the comma separator. Using this row we can first get the associate class label by retrieving  the first value in a row array, converting it into an int then adding it into our labels array. Now we can do the same thing for our features array.

We will take each feature and convert it to a float, since we want precision in our values. Our features array is now an array of arrays. Now that we pulled our data from our dataset file into  memory we’ve arrived on the last step.

Transforming Data

One possible transformation is decomposition. Sometimes we have features that are too complexed like the date. If we’re trying to predict which day in october is most likely to get rainfall this year, we don’t  really need the month and the year. If we decompose that feature into just a day of the month that will make our model more accurate.

Since we are satisfied with the features we have in there class labels, we will perform the only transformation we need. We’ll transform them into vectors.

Vectors are numerical representations of features. All features can be represented as vectors words, images, videos all of it. We can take these vectors and feed them into our neural network directly

We’ll convert our array of arrays into a 2d matrix using numpy’s matrix function and set the type to float. This is a matrix of feature vectors. Each vector contains a list of features for an instance. We also want to transform our class label array into a numpy array because a numpy array can easily be converted into a one-hot matrix. Then we’ll return our fully processed feature matrix and one-hot label matrix.

So what is one-hot is encoding?

 One-hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction. In simple words you give a number representation to all the vectors to differentiate them without giving any single vector priority, thus one-hot matrix.

Once we have our data processed we want to feed it into our graph. In Tensorflow the placeholder object is considered a gateway for data into our computation graph. So we want to initialize two place holders. One for our class labels and one for our associated feature vectors. And when we finally run our training set using the run function, we can feed our data into the graph using the feed dictionary parameter. The label placeholder gets the labels and the feature placeholder gets the features. When we run our model it will show the classification line that is created to seperate the meditating from the non meditating people and if we feed a new instance it will classify just like that.

Conclusion

So to break it down, the steps to prepare a dataset are Selecting the Right Data, Processing it and Transforming it. You can easily find public datasets on the web via a number of sources that I will attach the link below or you can use web scraping tools like Beautifulsoup to create them yourself. And once we have our data we will convert them into vectors which are numerical representations that our ML model can understand.