A guide to seeing the wood for the decision trees
Machine learning models can be used to improve efficiencies, identify risks or new opportunities and have applications across many different sectors. They either predict an exact value (e.g. next week’s sales) or predict a grouping, for example in a risk portfolio, whether the customer is high risk, medium risk or low risk.
It’s worth noting that machine learning doesn’t work well on all problems. In cases where the pattern is new and hasn’t been seen many times before, or where there isn’t enough data available, a machine learning model will not fare quite so well. Additionally, while there are techniques available to support a wide variety of use cases, there still remains a need for human validation, sense-checking and domain knowledge.
That being said, we can make a start in looking at what can be achieved by tackling one of the cases described above; let’s see what’s involved in the process of turning a set of customer data into a risk level prediction by a basic application of a machine learning technique.
Let’s get classy
To do this, we can use a classification model – predicting which classification, or group, each item belongs to. One type of algorithm that does this well is a random forest. This type of model is based on the Decision Tree, a method that works by splitting data for a set of subjects (in this case customers) by its different variables (the information about the customers) and to keep splitting, until they have been placed into a particular category. A random forest is a collection of such trees. Using multiple trees reduces the risk of over-fitting (a situation where a model works very well for the first particular set of data used to train it, but doesn’t work so well for subsequent data sets).
Creating something as complicated as this may seem like a daunting prospect. But the good news is that many languages have libraries that are pre-built to create this type of model. In this instance, I’m using the python library scikitlearn (along with the libraries pandas and numpy, which are helpful for managing the data sets).
Before continuing, make sure you have Python installed (I’m using Python2) and that you load in the three packages mentioned above. Do this in the terminal,
pip install pandas (and the same for numpy and sklearn).
The examples shown are in Jupyter notebook, which is an interface commonly used by Data Scientists during development. The same snippets will work directly in the Python console or any other Python IDE.
The import statements make the libraries available for your current session. Then go on to load data from your csv file into a data frame (this is a particular style of data grid used by pandas) and then add in your header names.
Now the data is held within the pandas dataframe (df), shown below by selecting the top five lines as a sample.
To prepare the model to make predictions, it needs to be “trained”. That is, it is shown a set of data that already has classifications associated with it. From that, the model learns about the relationship between the information it is given on the data subjects (in this case the customers) and the labels that are associated with them (whether the customers are high, medium or low risk).
In the case of a random forest model, relationships are found by partitioning or splitting the data by the features of the data set. For example, separating by the number of devices used, would split those records with the answer of one device, away from those which had an answer of two devices (there may be more than two groups depending on the cardinality of the data set). Further splitting is done using different pieces of information, until a decision can be made as to what final category the record falls into (e.g. the risk level in this instance).
Into the woods…
Once the model has been trained, it is then tested on extra data that it hasn’t yet seen. The new data has the original labels removed, and the model is asked to predict the values on its own.
To achieve this, the data set needs to be split in two. One part of the data set is used for training, and the other part is used for testing. The below sections of code achieve this by randomly assigning a value between 1 and 100 to each observation, and separating those rows with random numbers below 70 to be training and the rest to be testing. So roughly 70 per cent of the data will be used for training. Printing out a count of values in each data set will show that this has worked.
Now it’s time to get the training set ready for the model. A variable is created to hold references to the features (the information that helps determine the eventual category) and another variable to hold the categories themselves.
Firstly, create the variable for the categories. The variable train_labels in the example below, holds the contents of the risk_label column from the data set. These are currently the risk levels “high”, “med” or “low”, but are changed into numbers (0,1, 2) using a function called “factorize”.
Next, the names of the features are captured in a separate variable;
columns_for_features in the example below. At the same time, the random forest classifier is created and stored in the variable called
Now everything is ready to train the model. The classifier has a function,
fit, which is passed the training part of the data set (
train_df) but is told which columns to pay attention to, and is also passed the training labels, or categories that are already available.
So now the model should be trained. It will have worked out how the “features” are associated with the “labels” and be in a position to determine future labels for data when only the features are available.
Using the part of the data set that was reserved for testing, the model can be tested to see how well it performs. The classifier has a function called
predict, which is passed the features data from the test_df data set that was prepared earlier. The output of this is a set of integers (
0,1,2) which represent the labels (
'high', 'med', 'low'). These are the categories that have been predicted by the model.
This is very exciting, but not too meaningful just yet. A couple of quick steps will decode the values back to their textual labels and then compare the categories that the model came up with, to the original labels that were in the test data set.
The grid below displays the number in each of the real groups, compared to the predicted groups. What is shown here is that for the 10 observations that were high risk, the model predicted nine of them to be high and thought that one of them was medium risk. For the 18 that were low risk, the model predicted them all perfectly. And finally for the 10 that were medium risk, the model predicted seven of them correctly and three of them were incorrectly predicted to be high risk.
So this is a pretty good result. A few small steps and we were able to create a model, train it to recognise patterns in the data, and based on this training, have the model predict categories for customer data it has not seen before. A model like this could mean that rather than have agents in your company manually reviewing customer details, you could be in a position to short-cut the process and focus attention right away on the high risk clients.
In this instance the classifier predicted risk levels. But the same technique could be applied to predicting customer churn, machine breakdown, and many other business problems.
In reality a lot more time would be spent on this process, but this is a good first step to demonstrate the basic principles and show the key steps that go into it.
Additionally, we enjoyed the benefit of using a dataset that was pre-prepared. In most cases the glory of machine learning comes only after a significant amount of dirty work in getting the data in the right shape for modelling. This can include data cleaning, feature selection (choosing which data to include in the first place), transformation and formatting.
Each of these areas is a topic in its own right and deserves more than a light mention, but that’s probably enough for one day. ®