When preparing data for data science, data mining or machine learning projects you will create a data set that describes the various characteristics of the subject or case record. Each attribute will contain some descriptive information about the subject and is related to the target variable in some way.
In addition to these attributes, the data set will be enriched with various other internal/external data to complete the data set.
Some of the attributes in the data set can be grouped under the heading of Demographics. Demographic data contains attributes that explain or describe the person or event each case record is focused on. For example, if the subject of the case record is based on Customer data, this is the “Who” the demographic data (and features/attributes) will be about. Examples of demographic data include:
- Age range
- Marital status
- Number of children
- Household income
- Educational level
These features/attributes are typically readily available within your data sources and if they aren’t then these name be available from a purchased data set.
Additional feature engineering methods are used to generate new features/attributes that express meaning is different ways. This can be done by combining features in different ways, binning, dimensionality reduction, discretization, various data transformations, etc. The list can go on.
The aim of all of this is to enrich the data set to include more descriptive data about the subject. This enriched data set will then be used by the machine learning algorithms to find the hidden patterns in the data. The richer and descriptive the data set is the greater the likelihood of the algorithms in detecting the various relationships between the features and their values. These relationships will then be included in the created/generated model.
Another approach to consider when creating and enriching your data set is move beyond the descriptive features typically associated with Demographic data, to include Pyschographic data.
Psychographic data is a variation on demographic data where the feature are about describing the habits of the subject or customer. Demographics focus on the “who” while psycographics focus on the “why”. For example, a common problem with data sets is that they describe subjects/people who have things in common. In such scenarios we want to understand them at a deeper level. Psycographics allows us to do this. Examples of Psycographics include:
- Lifestyle activities
- Evening activities
- Purchasing interests – quality over economy, how environmentally concerned are you
- How happy are you with work, family, etc
- Social activities and changes in these
- What attitudes you have for certain topic areas
- What are your principles and beliefs
The above gives a far deeper insight into the subject/person and helps to differentiate each subject/person from each other, when there is a high similarity between all subjects in the data set. For example, demographic information might tell you something about a person’s age, but psychographic information will tell you that the person is just starting a family and is in the market for baby products.
I’ll close with this. Consider the various types of data gathering that companies like Google, Facebook, etc perform. They gather lots of different types of data about individuals. This allows them to build up a complete and extensive profile of all activities for individuals. They can use this to deliver more accurate marketing and advertising. For example, Google gathers data about what places to visit throughout a data, they gather all your search results, and lots of other activities. They can do a lot with this data. but now they own Fitbit. Think about what they can do with that data and particularly when combined with all the other data they have about you. What if they had access to your medical records too! Go Google this ! You will find articles about them now having access to your health records. Again combine all of the data from these different data sources. How valuable is that data?
When working with data sets for machine learning, lots of these data sets and examples we see have approximately the same number of case records for each of the possible predicted values. In this kind of scenario we are trying to perform some kind of classification, where the machine learning model looks to build a model based on the input data set against a target variable. It is this target variable that contains the value to be predicted. In most cases this target variable (or feature) will contain binary values or equivalent in categorical form such as Yes and No, or A and B, etc or may contain a small number of other possible values (e.g. A, B, C, D).
For the classification algorithm to perform optimally and be able to predict the possible value for a new case record, it will need to see enough case records for each of the possible values. What this means, it would be good to have approximately the same number of records for each value (there are many ways to overcome this and these are outside the score of this post). But most data sets, and those that you will encounter in real life work scenarios, are never balanced, as in having a 50-50 split. What we typically encounter might be a 90-10, 98-2, etc type of split. These data sets are said to be imbalanced.
The image above gives examples of two approaches for creating a balanced data set. The first is under-sampling. This involves reducing the class that contains the majority of the case records and reducing it to match the number of case records in the minor class. The problems with this include, the resulting data set is too small to be meaningful, the case records removed could contain important records and scenarios that the model will need to know about.
The second example is creating a balanced data set by increasing the number of records in the minority class. There are a few approaches to creating this. The first approach is to create duplicate records, from the minor class, until such time as the number of case records are approximately the same for each class. This is the simplest approach. The second approach is to create synthetic records that are statistically equivalent of the original data set. A commonly technique used for this is called SMOTE, Synthetic Minority Oversampling Technique. SMOTE uses a nearest neighbors algorithm to generate new and synthetic data we can use for training our model. But one of the issues with SMOTE is that it will not create sample records outside the bounds of the original data set. As you can image this would be very difficult to do.
The following examples will illustrate how to perform Under-Sampling and Over-Sampling (duplication and using SMOTE) in Python using functions from Pandas, Imbalanced-Learn and Sci-Kit Learn libraries.
NOTE: The Imbalanced-Learn library (e.g. SMOTE)requires the data to be in numeric format, as it statistical calculations are performed on these. The python function get_dummies was used as a quick and simple to generate the numeric values. Although this is perhaps not the best method to use in a real project. With the other sampling functions can process data sets with a sting and numeric.
Data Set: Is the Portuaguese Banking data set and is available on the UCI Data Set Repository, and many other sites. Here are some basics with that data set.
import warnings import pandas as pd import numpy as np import matplotlib.pyplot as plt get_ipython().magic('matplotlib inline') bank_file = ".../bank-additional-full.csv" # import dataset df = pd.read_csv(bank_file, sep=';',) # get basic details of df (num records, num features) df.shape
df['y'].value_counts() # dataset is imbalanced with majority of class label as "no".
no 36548 yes 4640 Name: y, dtype: int64
#print bar chart df.y.value_counts().plot(kind='bar', title='Count (target)');
Example 1a – Down/Under sampling the majority class y=1 (using random sampling)
count_class_0, count_class_1 = df.y.value_counts() # Divide by class df_class_0 = df[df['y'] == 0] #majority class df_class_1 = df[df['y'] == 1] #minority class # Sample Majority class (y=0, to have same number of records as minority calls (y=1) df_class_0_under = df_class_0.sample(count_class_1) # join the dataframes containing y=1 and y=0 df_test_under = pd.concat([df_class_0_under, df_class_1]) print('Random under-sampling:') print(df_test_under.y.value_counts()) print("Num records = ", df_test_under.shape) df_test_under.y.value_counts().plot(kind='bar', title='Count (target)');
Random under-sampling: 1 4640 0 4640 Name: y, dtype: int64 Num records = 9280
Example 1b – Down/Under sampling the majority class y=1 using imblearn
from imblearn.under_sampling import RandomUnderSampler X = df_new.drop('y', axis=1) Y = df_new['y'] rus = RandomUnderSampler(random_state=42, replacement=True) X_rus, Y_rus = rus.fit_resample(X, Y) df_rus = pd.concat([pd.DataFrame(X_rus), pd.DataFrame(Y_rus, columns=['y'])], axis=1) print('imblearn over-sampling:') print(df_rus.y.value_counts()) print("Num records = ", df_rus.shape) df_rus.y.value_counts().plot(kind='bar', title='Count (target)');
[same results as Example 1a]
Example 1c – Down/Under sampling the majority class y=1 using Sci-Kit Learn
from sklearn.utils import resample print("Original Data distribution") print(df['y'].value_counts()) # Down Sample Majority class down_sample = resample(df[df['y']==0], replace = True, # sample with replacement n_samples = df[df['y']==1].shape, # to match minority class random_state=42) # reproducible results # Combine majority class with upsampled minority class train_downsample = pd.concat([df[df['y']==1], down_sample]) # Display new class counts print('Sci-Kit Learn : resample : Down Sampled data set') print(train_downsample['y'].value_counts()) print("Num records = ", train_downsample.shape) train_downsample.y.value_counts().plot(kind='bar', title='Count (target)');
[same results as Example 1a]
Example 2 a – Over sampling the minority call y=0 (using random sampling)
df_class_1_over = df_class_1.sample(count_class_0, replace=True) df_test_over = pd.concat([df_class_0, df_class_1_over], axis=0) print('Random over-sampling:') print(df_test_over.y.value_counts()) df_test_over.y.value_counts().plot(kind='bar', title='Count (target)');
Random over-sampling: 1 36548 0 36548 Name: y, dtype: int64
Example 2b – Over sampling the minority call y=0 using SMOTE
from imblearn.over_sampling import SMOTE print(df_new.y.value_counts()) X = df_new.drop('y', axis=1) Y = df_new['y'] sm = SMOTE(random_state=42) X_res, Y_res = sm.fit_resample(X, Y) df_smote_over = pd.concat([pd.DataFrame(X_res), pd.DataFrame(Y_res, columns=['y'])], axis=1) print('SMOTE over-sampling:') print(df_smote_over.y.value_counts()) df_smote_over.y.value_counts().plot(kind='bar', title='Count (target)');
[same results as Example 2a]
Example 2c – Over sampling the minority call y=0 using Sci-Kit Learn
from sklearn.utils import resample print("Original Data distribution") print(df['y'].value_counts()) # Upsample minority class train_positive_upsample = resample(df[df['y']==1], replace = True, # sample with replacement n_samples = train_zero.shape, # to match majority class random_state=42) # reproducible results # Combine majority class with upsampled minority class train_upsample = pd.concat([train_negative, train_positive_upsample]) # Display new class counts print('Sci-Kit Learn : resample : Up Sampled data set') print(train_upsample['y'].value_counts()) train_upsample.y.value_counts().plot(kind='bar', title='Count (target)');
[same results as Example 2a]
Over the past few weeks or months (maybe even years) I’ve had several conversations with various people about why Data Science (or whatever you want to call it) projects fail or never really get started.
Before we go any further perhaps we need to define what ‘fail’ means in these conversations. Typically fail means that the project doesn’t deliver what was hoped for, it got bogged down is some technical or political issues, it did not deliver useful results, and more typically it is run once (or a couple of times) and never run again. You get the idea.
The following points outline some of the most typical reasons why Data Science projects fail, but this is not an exhaustive list. This list is just some of the most typical reason.
- We need Big Data: It seems like everything that you read says you need Big Data for your data science project. Firstly what big data means to one person or company can be very different to what it means for another person/company. One possible definition is that it might include all the various social media and log type of data. If you don’t have all of this data then no big deal. You can still do data science projects. You have lots and lots of other data. The data that you generate every day for the general running of your business. You can use that. If you have some history of this data going back over a few months or a couple of years then even better (and most of you will say Yes I have that data). Work with the data that you already have, that you already understand, that you are already using, etc and use that data to see if you can gain extra insights that will have some value to your business (it needs to have value otherwise whats the point). Some people call this everyday type of data you have, ‘Small Data’. Big Data or Small Data are really bad terms. It is just Data. Let us work with data we already have and incrementally add in newer data (from your typical ‘Big Data’ sources) with each iteration of the data science project.
- We need Big Technology: This kind of follows on from the mistake of believing we need Big Data to do our data science projects. As most companies will be working with the data that they already have, and you will have various technology solutions in place to manage this data. Then do we really need Big Data Technology solutions for our Data Science projects? Technologies like Hadoop and everything that goes along with it. The simple answer is ‘No You Don’t’. Now don’t get me wrong. These technologies are important with it comes to managing Big Data, but you don’t needs these to perform your data science projects. Many, many companies both large and small are performing data science projects using their existing technology solutions and have perhaps just added some analytics tools to support their project using the data that they are already managing. Most companies have databases to store and manage their data. You can use your analytics software to work with the data in these database to analyse, model and predict. Any results that are produced can be easily integrated back into these databases and the results can then be used by various groups within your organisation. Use the technologies you have, that you understand, that you can use to the max, supplemented with some newer analytics software that works with all of these for your data science projects. (An example: one project I’ve worked on included a retail organisation for one of the largest countries in the work. I was working with 3 years of sales data. Is this big data? I was able to use my laptop to perform advanced analytics on all their data)
- Old School Data Science: Give me all your data, I’ll analyse it and tell you what is happening. Unfortunately this kind of phrases are still very common. They are common and considered out of date 20 years ago when I worked on my first data science project (it wasn’t called data science back then). If you do come across someone saying this to you, I would question their ability to deliver anything. If it was me, I would just say ‘No thank you’, and move onto someone else. You as a company will already know a lot of what is happening in your business, what data is currently being used for and any potential areas where you know advanced analytics and data science can help. You will know that the focus areas should be and how good or not your data is. You need someone who can help you to identify the key areas and what data science techniques can be used to help you to gain (a possible) greater insight into what is happening.
- No clear objective or business question/problem and no measurable outcomes: In a way this is very similar to the previous point. You don’t get into your car each morning and start driving, with the eventual hope that you arrive at work on time. No, you plan what you want to do (get to work), how you are going to get then (using your car) and when you want to get there by (your work start time). Using these you then plan out what is the best route you need to take to get to work, in the most efficient way you can, using your knowledge and experience of the road network, supplemented by traffic reports and making adjustments as necessary, to ensure that you get to work on time. This is exactly the same for data science projects. You need a good clear objective, that can be broken down into distinct problems, that will each require a specific set of advanced analytics to generate a measurable outcome. The measurable outcomes should allow you to measure if the advanced analytics actually gives you a valuable return. For example if you predict that you can increase sales by 3%, this sound good. But if the cost of implementing the solution is treating any the profit generated then you might decide that this solution is not worth continuing with.
- Not productionalising the outcomes: This point follows on from the previous two points. A lot of what you read and a lot of what I’ve seen is that Data Science looks are discovering some new (and actionable) insights. But that is where the discussion ends. As if a report is produced that makes a recommendation or a list of customers to target, and that is it. What happens to your data science project then. It really gets canned or you might be told that we will come back to it in a few months (and possibly a year) from now. This is not what you really want. Why? because when you finally remember to come back to review the project and to do another run, the people who where involved in the original project have moved on or are not available. It then become too difficult to start over again and that is when the data science project fails. I’ve used the word ‘productionalising’ (is that a real word?) What I mean by that is that we need to take our data science project and build it into our every day applications and processes. For example if we build a customer risk model for loans in a bank. This should be built into the application that captures the loan application by the customer. That way when the bank employee is entering the loan application they can be given live feedback. They can then use this live feedback to address any issues with the customer. What can be typical is that this is discovered some weeks later when the loan has already been approved. We need to automate the use of our data science work. Another example is fraud detection. I know of several companies who have fraud detection measures in place. It can take them 4-6 weeks to identify a potential fraud case that needs investigation. Using data science and building this into their transaction monitoring systems they can now detect potential fraud cases in near real time )no big data architectures being used). By automating it we get quicker response and take actions at the right time. The quicker we can react the more money we can make or save. This is an area that a lot of companies are now focusing on when they are looking at data science project as this is they way that they can get a quicker return on their investment in their data science projects.
- Very little senior management support: I think most of the data science projects are supported by senior management to some extent. The more successful the data science project the more involved the senior managers are and the more they understand of what these projects can potentially deliver. But with the ever changing and evolving world of IT most of the senior managers are very focused on the here and now, keeping the lights on, making sure their day-to-day applications are up and running, the backups and recovery processes are in place (and tested), and future proofing their application. It is well known that very little time and resources (human and money) are available for adding new functionality. Most of what I’ve mentioned is very IT related and perhaps the IT managers are not the most suitable people to sponsor data science projects. I’ve already some of the reasons but sometimes IT can get a bit caught up with the technology and trying to use the newest thing. Some of the most successful projects I’ve worked on have had senior managers from a business function. They will not be focused on the technology but on the processes around the data science project and how the outputs of the data science project can be used. The more focused they are on this the more successful the project will be. They will then act as the key to informing (and selling) the rest of the business on the success of the project. This in turn create more and more data scicene projects and will keep you busy for a long time to come
- Ticking the box: Unfortunately I’ve seen this in way too many companies. Board level or the senior management team have hear about data science and all the magic that is can produce. The message is then passed down through the organisation that we need to be doing more and more of this. A business unit is chosen as for the pilot project. The pilot is completed, successfully, and the good news message is fed back up the ladder. But that is when enthusiasm ends. We have done a data science project, it was successful and now lets move on to the next thing. I’ve seen pilot or POC project that have proven to potentially save $10+M a year with a cost of $100K per year, being canned. Yes I’ve been told this is fantastic, this is beyond our wildest dreams. Only for nothing else to happen.
- The data is no good: You need data, you need historical data. The more you have more more useful it will be for the data science project. But what if the data is of poor quality? How can this happen? Well it can happen very frequently. You may have applications that are poorly designed, that have a very poor data model, the staff are not trained correctly to ensure that good data gets entered, etc. etc. The list could go on and on. It is one thing for an application to capture data but if that data cannot be used for any meaningful purposes then it has very little value. Some companies have people hired that constantly inspect the data, assess the quality of the data and are then feeding back ideas on how to improve the quality of the data captured by the applications and also by the people inputting the data. Without good quality data then there is very little a data science project can do to magically convert it into good quality data. I’ve been in the situation where >90% of the data was unusable. We give them a list what improvements they needed to make and only come back to use then they have completed these and have at least 6 months of good quality data. We might be able to do something then. We never heard from them again. Also I get to talk to a lot of start ups who want to have data science build in from day one. These have very little ‘real’ data. Again I get to tell them come back to me when you have 6 months of data.
- Too much focus on descriptive analytics: Although descriptive analytics is an important step in the early stages of all data science projects, they is still a huge number of consulting and product companies who are promoting this as a data science project. Like I said descriptive analytics is an important step, but it doesn’t end there. It is just the beginning. When selecting a consulting or product company to partner with on your data science projects you need to ensure that they are offering more than just descriptive analytics. In a similar way to what I’ve mentioned in the points above, you need to look at how you can make use of these descriptive analytics and share them with the wider community in your company. But you also need to have some control over the proliferation of various visualisation tools. Descriptive analytics and visualisations is not data science or a major output of data science. It is only one part of a data science project and far more value outputs from a data science project can be achieved by using one or more of the advanced analytics methods that are available to you.
- Ignoring your BI/DW: Unfortunately when it comes to a lot of data science projects your have two very different approaches to working with the data. One approach seems to be that we will look at your data that is available in the transactional databases (and other data sources), we will then look at how to integrate and clean this data before getting onto the fund stuff of exploring and then performing the advanced analytics. This approach completely ignores the BI team and any data warehouse that might exist. If a data warehouse already exists then it probably contains all or most of the data you are going to use. Therefore you can avoid all that them spent integrating and cleaning the data. The data warehouse will have this done for you. Plus the data warehouse will have a lot more data than what the current transactional databases will contain. Please, Please, Please use the data in the data warehouse and you will find that you will save a lot of time on your data science project. In addition to the time saved you will have a lot more (possibly years of) data to work with. I always try to work with data warehouse data. When I do I can go back 5 years and build predictive models from back then. I can then roll these through various time periods and can easily measure how good the level of predictive I’m getting. I also get to see if there are any changes in the data and how they affect the models. Plus I also get to see how the various algorithms and their associated models change and evolve over time. This allows me to demonstrate to the customer how the use of data science and predictive models works with their data over the past 5 years. This build up confidence with the customer on what is being done and what can be achieved. In one case I was able to demonstrate that if they implemented my solution 5 years ago, they would have save $40+M in that time period. If I didn’t use the data warehouse I wouldn’t have been able to prove this. Needless to say the customer was very happy.
- Make up of team is wrong: You don’t need a team of PhDs: There has been lots written about what the make up of skills what your data science team should be. Back a few years ago all the talk was that you need to have people with PhDs maths, stats or related states. Plus all you needed to do was to hire one of these. We all know that this is not true but was part of the rubbish that people were talking about. We all know that you really need a team of people and perhaps you already have some of these people already employed in your company already. You have database people, you have ETL people, you have data integration people, you have data analysts, you have project managers, you have business analysts, you have domain experts, etc. How many of those people have PhDs or require a PhD to do their job. But perhaps you don’t have people with the skills of applying advanced analytic techniques to your data and business problems. Perhaps it is these people who you really need the most. Do these people really need to have a PhD? No they don’t. You need someone who knows and understands the various techniques and most importantly how to use these to solve business problems. All too often people try to show off about using a particular technique or parameter setting, or a particular formula, or graphic technique, or using a certain language over another, or what library or package is the best. Don’t engage in this. Look for people that can apply the correct technique or combination of techniques to your business problems. But despite what I said in the first two point, as your data management requirements grow you are going to need some addition people with some big data technologies.
- Communication: being able to explain what data science can do, what it is producing and relating that back to the business. Being able to work with the management team, end users and all involved to show and explain what and how the data science project can do to support their work. Most technical people are not good at this. Bus some people are and these are a very valuable resource as part of your data science team or are keen supporter of what data science can do and how it can be used to help the business developed new and interesting actionable insights.
- The output is not a report => You need to operationise/productionalise the data science project: See the point above on productionalising your data science work. The outputs should not be a report or a list of some form. With proper planning data science can become a central to all the operational systems in your company. They can help you make better and quicker decisions on how you interact with your customers, improve the efficiencies of your processes, etc. The list goes on and on. All data science projects are cyclical in nature. For example you developer a churn prediction system. You use this to interact with your customers. You are trying to change or alter their behaviour and this in turn changes them as a customer. This in turn affect the churn prediction system. It will no longer be as effective. So you will need to update it on a semi-regular basis. This could be every 3, 4, 6, or 12 months. It all depends. You can build in checks into your productionalised data science projects to detect when the predictive models need updating. This in turn helps your data science team to be more productive, with quicker turn around times of each iteration. Also with each iteration you can look to see if new data is available for you to include and use. Maybe at this point some of your big data sources are coming online with some useful data.
So when looking to start a Data Science project it is important to know a few things before you start. The following attempts to use the 5 W’s to try explains these.
- what you are doing
- why you are doing it
- who it is for and what they will gain from it
- where will it be used within your applications/processes
- when you are going to commence the project and how it will fit into strategic goals of your organisation
There has been plenty written about what magic Data Science projects will produce and bring to your organisation. You need to be careful of people who only talk about the magic. You also need to understand that it may not work or deliver what you are lead to believe. In all the projects I’ve worked on we have had some amazing results. But in one or two projects we have had results that where only a percentage or two better than what they are already doing.
Perhaps I need to write another blog post on ‘Why Data Science projects succeed’, and this will only be based on what I’ve experienced (in the real-world).
Like I said at the beginning, this is not an exhaustive list. There are many more and I’m sure you will have a few of your own. These are the typical reasons that I’ve come across in my 20 years of doing these kind of projects and long before the term data science existed.