Back in the middle of 2018 MIT Press published my Data Science book, co-written with John Kelleher. It book was published as part of their Essentials Series.
During the few months it was available in 2018 it became a best seller on Amazon, and one of the top best selling books for MIT Press. This happened again in 2019. Yes, two years running it has been a best seller!
2020 kicks off with the book being translated into Korean and Chinese. Here are the covers of these translated books.
The Japanese and Turkish translations will be available in a few months!
Go get the English version of the book on Amazon in print, Kindle and Audio formats.
This book gives a concise introduction to the emerging field of data science, explaining its evolution, relation to machine learning, current uses, data infrastructure issues and ethical challenge the goal of data science is to improve decision making through the analysis of data. Today data science determines the ads we see online, the books and movies that are recommended to us online, which emails are filtered into our spam folders, even how much we pay for health insurance.
Go check it out.
With every data analytics and data science project, one of the first tasks to that everyone needs to do is to profile the data sets. Data profiling allows you to get an initial picture of the data set, see data distributions and relationships. Additionally it allows us to see what kind of data cleaning and data transformations are necessary.
Most data analytics tools and languages have some functionality available to help you. Particular the various data science/machine learning products have this functionality built-in them and can do a lot of the data profiling automatically for you. But if you don’t use these tools/products, then you are probably using R and/or Python to profile your data.
With Python you will be working with the data set loaded into a Pandas data frame. From there you will be using various statistical functions and graphing functions (and libraries) to create a data profile. From there you will probably create a data profile report.
But one of the challenges with doing this in Python is having different coding for handling numeric and character based attributes/features. The describe function in Python (similar to the summary function in R) gives some statistical summaries for numeric attributes/features. A different set of functions are needed for character based attributes. The Python Library repository (https://pypi.org/) contains over 200K projects. But which ones are really useful and will help with your data science projects. Especially with new projects and libraries being released on a continual basis? This is a major challenge to know what is new and useful.
For example the followings shows loading the titanic data set into a Pandas data frame, creating a subset and using the describe function in Python.
import pandas as pd df = pd.read_csv("/Users/brendan.tierney/Dropbox/4-Datasets/titanic/train.csv") df.head(5)
df2 = df.iloc[:,[1,2,4,5,6,7,8,10,11]] df2.head(5)
You will notice the describe function has only looked at the numeric attributes.
One of those 200+k Python libraries is one called pandas_profiling. This will create a data audit report for both numeric and character based attributes. This most be good, Right? Let’s take a look at what it does.
For each column the following statistics – if relevant for the column type – are presented in an interactive HTML report:
- Essentials: type, unique values, missing values
- Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
- Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
- Most frequent values
- Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
- Missing values matrix, count, heatmap and dendrogram of missing values
The first step is to install the pandas_profiling library.
pip3 install pandas_profiling
Now run the pandas_profiling report for same data frame created and used, see above.
import pandas_profiling as pp df2.profile_report()
The following images show screen shots of each part of the report. Click and zoom into these to see more details.
When working with data sets for machine learning, lots of these data sets and examples we see have approximately the same number of case records for each of the possible predicted values. In this kind of scenario we are trying to perform some kind of classification, where the machine learning model looks to build a model based on the input data set against a target variable. It is this target variable that contains the value to be predicted. In most cases this target variable (or feature) will contain binary values or equivalent in categorical form such as Yes and No, or A and B, etc or may contain a small number of other possible values (e.g. A, B, C, D).
For the classification algorithm to perform optimally and be able to predict the possible value for a new case record, it will need to see enough case records for each of the possible values. What this means, it would be good to have approximately the same number of records for each value (there are many ways to overcome this and these are outside the score of this post). But most data sets, and those that you will encounter in real life work scenarios, are never balanced, as in having a 50-50 split. What we typically encounter might be a 90-10, 98-2, etc type of split. These data sets are said to be imbalanced.
The image above gives examples of two approaches for creating a balanced data set. The first is under-sampling. This involves reducing the class that contains the majority of the case records and reducing it to match the number of case records in the minor class. The problems with this include, the resulting data set is too small to be meaningful, the case records removed could contain important records and scenarios that the model will need to know about.
The second example is creating a balanced data set by increasing the number of records in the minority class. There are a few approaches to creating this. The first approach is to create duplicate records, from the minor class, until such time as the number of case records are approximately the same for each class. This is the simplest approach. The second approach is to create synthetic records that are statistically equivalent of the original data set. A commonly technique used for this is called SMOTE, Synthetic Minority Oversampling Technique. SMOTE uses a nearest neighbors algorithm to generate new and synthetic data we can use for training our model. But one of the issues with SMOTE is that it will not create sample records outside the bounds of the original data set. As you can image this would be very difficult to do.
The following examples will illustrate how to perform Under-Sampling and Over-Sampling (duplication and using SMOTE) in Python using functions from Pandas, Imbalanced-Learn and Sci-Kit Learn libraries.
NOTE: The Imbalanced-Learn library (e.g. SMOTE)requires the data to be in numeric format, as it statistical calculations are performed on these. The python function get_dummies was used as a quick and simple to generate the numeric values. Although this is perhaps not the best method to use in a real project. With the other sampling functions can process data sets with a sting and numeric.
Data Set: Is the Portuaguese Banking data set and is available on the UCI Data Set Repository, and many other sites. Here are some basics with that data set.
import warnings import pandas as pd import numpy as np import matplotlib.pyplot as plt get_ipython().magic('matplotlib inline') bank_file = ".../bank-additional-full.csv" # import dataset df = pd.read_csv(bank_file, sep=';',) # get basic details of df (num records, num features) df.shape
df['y'].value_counts() # dataset is imbalanced with majority of class label as "no".
no 36548 yes 4640 Name: y, dtype: int64
#print bar chart df.y.value_counts().plot(kind='bar', title='Count (target)');
Example 1a – Down/Under sampling the majority class y=1 (using random sampling)
count_class_0, count_class_1 = df.y.value_counts() # Divide by class df_class_0 = df[df['y'] == 0] #majority class df_class_1 = df[df['y'] == 1] #minority class # Sample Majority class (y=0, to have same number of records as minority calls (y=1) df_class_0_under = df_class_0.sample(count_class_1) # join the dataframes containing y=1 and y=0 df_test_under = pd.concat([df_class_0_under, df_class_1]) print('Random under-sampling:') print(df_test_under.y.value_counts()) print("Num records = ", df_test_under.shape) df_test_under.y.value_counts().plot(kind='bar', title='Count (target)');
Random under-sampling: 1 4640 0 4640 Name: y, dtype: int64 Num records = 9280
Example 1b – Down/Under sampling the majority class y=1 using imblearn
from imblearn.under_sampling import RandomUnderSampler X = df_new.drop('y', axis=1) Y = df_new['y'] rus = RandomUnderSampler(random_state=42, replacement=True) X_rus, Y_rus = rus.fit_resample(X, Y) df_rus = pd.concat([pd.DataFrame(X_rus), pd.DataFrame(Y_rus, columns=['y'])], axis=1) print('imblearn over-sampling:') print(df_rus.y.value_counts()) print("Num records = ", df_rus.shape) df_rus.y.value_counts().plot(kind='bar', title='Count (target)');
[same results as Example 1a]
Example 1c – Down/Under sampling the majority class y=1 using Sci-Kit Learn
from sklearn.utils import resample print("Original Data distribution") print(df['y'].value_counts()) # Down Sample Majority class down_sample = resample(df[df['y']==0], replace = True, # sample with replacement n_samples = df[df['y']==1].shape, # to match minority class random_state=42) # reproducible results # Combine majority class with upsampled minority class train_downsample = pd.concat([df[df['y']==1], down_sample]) # Display new class counts print('Sci-Kit Learn : resample : Down Sampled data set') print(train_downsample['y'].value_counts()) print("Num records = ", train_downsample.shape) train_downsample.y.value_counts().plot(kind='bar', title='Count (target)');
[same results as Example 1a]
Example 2 a – Over sampling the minority call y=0 (using random sampling)
df_class_1_over = df_class_1.sample(count_class_0, replace=True) df_test_over = pd.concat([df_class_0, df_class_1_over], axis=0) print('Random over-sampling:') print(df_test_over.y.value_counts()) df_test_over.y.value_counts().plot(kind='bar', title='Count (target)');
Random over-sampling: 1 36548 0 36548 Name: y, dtype: int64
Example 2b – Over sampling the minority call y=0 using SMOTE
from imblearn.over_sampling import SMOTE print(df_new.y.value_counts()) X = df_new.drop('y', axis=1) Y = df_new['y'] sm = SMOTE(random_state=42) X_res, Y_res = sm.fit_resample(X, Y) df_smote_over = pd.concat([pd.DataFrame(X_res), pd.DataFrame(Y_res, columns=['y'])], axis=1) print('SMOTE over-sampling:') print(df_smote_over.y.value_counts()) df_smote_over.y.value_counts().plot(kind='bar', title='Count (target)');
[same results as Example 2a]
Example 2c – Over sampling the minority call y=0 using Sci-Kit Learn
from sklearn.utils import resample print("Original Data distribution") print(df['y'].value_counts()) # Upsample minority class train_positive_upsample = resample(df[df['y']==1], replace = True, # sample with replacement n_samples = train_zero.shape, # to match majority class random_state=42) # reproducible results # Combine majority class with upsampled minority class train_upsample = pd.concat([train_negative, train_positive_upsample]) # Display new class counts print('Sci-Kit Learn : resample : Up Sampled data set') print(train_upsample['y'].value_counts()) train_upsample.y.value_counts().plot(kind='bar', title='Count (target)');
[same results as Example 2a]