What is machine learning?
Machine learning (ML) is an interesting field, but for many people a mystery. The reason might be because the field is technically challenging. So, let’s try to explain some of this mystery by looking at some of the basics.
Firstly, ML is interdisciplinary in nature, and employs techniques from fields such as computer science, maths, statistics and artificial intelligence. ML works with data and can discover patterns by training models. The models use algorithms which facilitate improvement from experience.
An often-used definition of the ML term was given by Tom Mitchell in 1997:
“A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.” -- Tom Mitchell, Carnegie Mellon University
Let us explain this definition even more. In our webinar, we will look at a dataset from Kaggle containing employee information, and a label indicating whether they have quit their job or not. We can define that the employee dataset is experience E, while predicting whether an employee will leave their job or not, is our task T. In the end, we want our model to figure out the pattern and have a good accuracy, which is the measure P.
Basic steps of machine learning:
1. Firstly, we need data. So, we must collect it. Data can come from different sources such as databases, the internet or files such as csv, excel or txt. The better the variety, density and volume of relevant data, the better our model will learn.
2. When we have the necessary data, we must prepare it for the analytical process. Assessing the quality of the data is mandatory! Furthermore, taking time to fix issues such as missing data and treatment of outliers is important. In the end, the data should be represented as a table, usually called a feature matrix. Each column in the table is called a feature vector. You want the model to be as simple as possible and generalise well to new data. Thus, you must find out which features give accurate predictions, or create new ones. This is called feature engineering, an important and time consuming part of the data preparation process.
3. When the data is made ready for the computer, we can start training our model. This step involves choosing correct algorithms for the task. Also, the cleaned data is split into two parts – train and test sets. The training set is used to learn the model, while the latter (test set) is used to evaluate the model.
4. To figure out how good your model is, you need to evaluate the model. You will now use your model on the test set. Remember that your model has never seen the test set, and therefore it will give an indication of how well it has learned. In this step, you can see which algorithms perform well. Also, you will get an indication whether you need to tune the parameters of your algorithm.
5. Congrats, you have now performed one cycle of the process! If you are not satisfied with the results, you need to improve the performance of the model. This is done by going through all, or some of the steps again, and is the main reason why significant amount of time should be spent in data collection and preparation.
Type of machine learning tasks
Classification. This concerns building models that separate data into distinct classes. Our employee example is a classification task. We are separating the employees into quit and not quit classes. Other examples of classification tasks are image detection, medical diagnosis, speech recognition, product recommendation etc.
Regression is the second type of prediction. It is closely related to classification, but now we want to predict a continuous variable. Think of house prices, flight delay in minutes, time to failure of mechanical components or number of bikes rented at a specific time.
Both classification and regression are forms of so-called supervised learning. The program is tuned on a set of training examples where you know the correct answers, called labels. When the algorithm has “learned” what the pattern is, you can use the model on new, future examples. A spam filter is an example of supervised machine learning. The spam filter teaches to detect malicious mails by training on a set of emails which one knows are spam and not spam. It can then reach an accurate conclusion when a new, unknown email is presented.
Clustering is used to analyse data which do not have any labels, meaning we do not have the answer for the specific observation. Think of not knowing whether the employee is leaving or not. Clustering then tries to find a pattern in the data, and group the data looking like each other.
Association rule learning can be most easily explained by market basket analysis. If a customer buys onions and potatoes, then they will very likely buy a hamburger. This is an example, but indicates how this type of learning works. Of course, you cannot move the shelves in a store that easily, but within e-commerce this type of analysis might be very interesting.
Clustering and association are forms of unsupervised learning. Here, there is no desired output, so the model must find the patterns from observations, opposed to supervised where one finds patterns from examples.
Why is it so popular today?
What makes Machine Learning great today is that we have an abundance of data, computation is cheap and the tools necessary for performing the analyses are getting better and easier to use. Programming languages such as Python and R are relatively easy to learn, and people around the world are creating interesting machine learning packages. This means it is possible to quickly produce models that can analyse bigger and more complex datasets, and deliver accurate results faster. By building predictive models, organizations have a better shot of identifying opportunities or avoiding unknown risks.
Watch our machine learning webinar!
In the webinar, we will explain and go through the steps of machine learning, such as training and evaluating models. We will also look at Microsoft cloud technology and how machine learning can be performed using drag and drop functionality in Azure. A Kaggle dataset will be analysed in Azure Machine Learning and we will deploy the model to be used in production.
Click HERE to watch the webinar on-demand now.
Emir is Innofactor's Data Analyst working mainly with machine learning and analytics using Microsoft Cloud technology. He is proficient with the programming languages R and Python and works with big data frameworks such as Hadoop, Spark and R Server. Prior to joining Innofactor he worked within the offshore industry with statistical computing of environmental loads and analyzing large datasets from advanced structural models.