Analytics

Automated Data Visualizations in Python

Posted on Updated on

Creating data visualizations in Python can be a challenge. For some it an be easy, but for most (and particularly new people to the language) they always have to search for the commands in the documentation or using some search engine. Over the past few years we have seem more and more libraries coming available to assist with many of the routine and tedious steps in most data science and machine learning projects. I’ve written previously about some data profiling libraries in Python. These are good up to a point, but additional work/code is needed to explore the data to suit your needs. One of these Python libraries, designed to make your initial work on a new data set easier is called AutoViz. It’s good to see there is continued development work on this library, which can be really help for creating initial sets of charts for all the variables in your data set, plus it has some additional features which help to make it very useful and cuts down on some of the additional code you might need to write.

The outputs from AutoViz are very extensive, and are just too long to show in this post. The images below will give you an indication of what if typically generated. I’d encourage you to install the library and run it on one of your data sets to see the full extent of what it can do. For this post, I’ll concentrate on some of the commands/parameters/settings to get the most out of AutoViz.

Firstly there is the install via pip command or install using Anaconda.

pip3 install autoviz

For comparison purposes I’m going to use the same data set as I’ve used in the data profiling post (see above). Here’s a quick snippet from that post to load the data and perform the data profiling (see post for output)

import pandas as pd
import pandas_profiling as pp

#load the data into a pandas dataframe
df = pd.read_csv("/Users/brendan.tierney/Downloads/Video_Games_Sales_as_at_22_Dec_2016.csv")

We can not skip to using AutoViz. It supports the loading and analzying of data sets direct from a file or from a pandas dataframe. In the following example I’m going to use the dataframe (df) created above.

from autoviz import AutoViz_Class

AV = AutoViz_Class()
df2 = AV.AutoViz(filename="", dfte=df)  #for a file, fill in the filename and remove dfte parameter

This will analyze the data and create lots and lots of charts for you. Some are should in the following image. One the helpful features is the ‘data cleaning improvements’ section where it has performed a data quality assessment and makes some suggestions to improve the data, maybe as part of the data preparation/cleaning step.

There is also an option for creating different types of out put with perhaps the Bokeh charts being particularly useful.

  • chart_format='bokeh': interactive Bokeh dashboards are plotted in Jupyter Notebooks.
  • chart_format='server', dashboards will pop up for each kind of chart on your web browser.
  • chart_format='html', interactive Bokeh charts will be silently saved as Dynamic HTML files under AutoViz_Plots directory
df2 = AV.AutoViz(filename="", dfte=df, chart_format='bokeh')

The next example allows for the report and charts to be focused on a particular dependent or target variable, particular in scenarios where classification will be used.

df2 = AV.AutoViz(filename="", dfte=df, depVar="Platform")

A little warning when using this library. It can take a little bit of time for it to run and create all the different charts. On the flip side, it save you from having to write lots and lots of code!

Advertisement

CAO Points 2022 – Grade inflation, deflation or in-line

Posted on Updated on

Last week I wrote a blog post analysing the Leaving Cert results over the past 3-8 years. Part of that post also looked at the claim from the Dept of Education saying the results in 2022 would be “in-line on aggregate” with the results from 2021. The outcome of the analysis was grade deflation was very evident in many subjects, but when analysed and profiled at a very high level, they did look similar.

I didn’t go into how that might impact on the CAO (Central Applications Office) Points. If there was deflation in some of the core and most popular subjects, then you might conclude there could be some changes in the profile of CAO Points being awarded, and that in turn would have a small change on the CAO Points needed for a lot of University courses. But not all of them, as we saw last week, the increased number of students who get grades in the H4-H7 range. This could mean a small decrease in points for courses in the 520+ range, and a small increase in points needed in the 300-500-ish range.

The CAO have published the number of students of each 10 point range. I’ve compared the 2022 data, with each year going back to 2015. The following table is a high level summary of the results in 50 point ranges.

An initial look at these numbers and percentages might look like points are similar to last year and even 2020. But for 2015-2019 the similarity is closer. Again looking back at the previous blog post, we can see the results profiles for 20215-2019 are broadly similar and does indicate some normalisation might have been happening each year. The following chart illustrate the percentage of students who achieved points in each range.

From the above we can see the profile is similar across 2015-2019, although there does seem to be a flattening of the curve between 2015-2016!

Let’s now have a look at 2019 (the last pre-coivd year), 2021 and 2022. This will allow use to compare the “inflated” years to the last “normal” year.

This chart clearly shows a shifting of the profile to the left for the red line which represents 2022. This also supports my blog post last week, and that the Dept of Education has started the process of deflating marks.

Based on this shifting/deflating of marks, we could see the grade/CAO Points profiles reverting back to almost 2019 profile by 2025. For students sitting the Leaving Cert in 2023, there will be another shift to the left, and with another similar shift in 2024. In 2024, the students will be the last group to sit the Leaving Cert who were badly affected during the Covid years. Many of them lost large chunks on school and many didn’t sit the Junior Cert. I’d predict 2025 will see the first time the marks/points profiles will match pre-covid years.

For this analysis I’ve used a variety of tools including Excel, Python and Oracle Analytics.

The Dataset used can be found under Dataset menu, and listed as ‘CAO Points Profiles 2015-2022’. Also, check out the Leaving Certificate 2015-2022 dataset.

Leaving Cert 2022 Results – Inflation, deflation or in-line!

Posted on Updated on

The Leaving Certificate 2022 results are out. Up and down the country there are people who are delighted with their results, while others are disappointed, and lots of other emotions.

The Leaving Certificate is the terminal examination for secondary education in Ireland, with most students being examined in seven subjects, with their best six grades counting towards their “points”, which in turn determines what university course they might get. Check out this link for learn more about the Leaving Certificate.

The Dept of Education has been saying, for several months, this results this year (2022) will be “in-line on aggregate” with the results from 2021. There has been some concerns about grade inflation in 2021 and the impact it will have on the students in 2022 and future years. At some point the Dept of Education needs to address this grade inflation and look to revert back to the normal profile of grades pre-Covid.

Let’s have a look to see if this is true, and if it is true when we look a little deeper. Do the aggregate results hide grade deflation in some subjects.

For the analysis presented in this blog post, I’ve just looked at results at Higher Level across all subjects, and for the deeper dive I’ll look at some of the most popular subjects.

Firstly let’s have a quick look at the distribution of grades by subject for 2022 and 2021.

Remember the Dept of Education said the 2022 results should be in-line with the results of 2021. This required them to apply some adjustments, after marking the exam scripts, to give an updated profile. The following chart shows this comparison between the two years. On initial inspection we can see it is broadly similar. This is good, right? It kind of is and at a high level things look broadly in-line and maybe we can believe the Dept of Education. Looking a little closer we can see a small decrease in the H2-H4 range, and a slight increase in the H5-H8.

Let’s dive a little deeper. When we look at the grade profile of students in 2021 and 2022, How many subjects increased the number of students at each grade vs How many subjects decreased grades vs How many approximately stated the same. The table below shows the results and only counts a change if it is greater than 1% (to allow for minor variations between years).

This table in very interesting in that more subjects decreased their H1s, with some variation for the H2-H4s, while for the lower range of H5-H7 we can see there has been an increase in grades. If I increased the margin to 3% we get a slightly different results, but only minor changes.

“in-line on aggregate” might be holding true, although it appears a slight increase on the numbers getting the lower grades. This might indicate either more of an adjustment to weaker students and/or a bit of a down shifting of grades from the H2-H4 range. But at the higher end, more subjects reduced than increase. The overall (aggregate) numbers are potentially masking movements in grade profiles.

Let’s now have a look at some of the core subjects of English, Irish and Mathematics.

For English, it looks like they fitted to the curve perfectly! keeping grades in-line between the two years. Mathematics is a little different with a slight increase in grades. But when you look at Irish we can see there was definite grade deflation. For each of these subjects, the chart on the left contains four years of data including 2019 when the last “normal” leaving certificate occurred. With Irish the grade profile has been adjusted (deflated) significantly and is closer to 2019 profile than it is to 2021. There was been lots and lots of discussions nationally about how and when grades will revert to normal profile. The 2022 profile for Irish seems to show this has started to happen in this subject, which raises the question if this is occurring in any other subjects, and is hidden/masked by the “in-line on aggregate” figures.

This blog post would become just too long if I was to present the results profile for each of the 42+ subjects.

Let’s have a look as two of the most common foreign languages, French and Spanish.

Again we can see some grade deflation, although not to be same extent as Irish. For both French and Spanish, we have reduced numbers for the H2-H4 range and a slight increase for H5-H7, and shift to the left in the profile. A slight exception is for those getting a H1 for both subjects. The adjustment in the results profile is more pronounced for French, and could indicate some deflation adjustments.

Next we’ll look at some of the science subjects of Physics, Chemistry and Biology.

These three subjects also indicate some adjusts back towards the pre-Covid profile, with exception of H1 grades. We can see the 2022 profile almost reflect the 2019 profile (excluding H1s) and for Biology appears to be at a half way point between 2019 and 2022 (excluding H1s)

Just one more example of grade deflation, and this with Design, Communication and Graphics (or DCG)

Yes there is obvious grade deflation and almost back to 2019 profile, with the exception of H1s again.

I’ve mentioned some possible grade deflation in various subjects, but there are also subjects where the profile very closely matches the 2021 profile. We have seen above English is one of those. Others include Technology, Art and Computer Science.

I’ve analyzed many more subjects and similar shifting of the profile is evident in those. Has the Dept of Education and State Examinations Commission taken steps to start deflating grades from the highs of 2021? I’d said the answer lies in the data, and the data I’ve looked at shows they have started the deflation process. This might take another couple of years to work out of the system and we will be back to “normal” pre-covid profiles. Which raises another interesting question, Was the grade profile for subjects, pre-covid, fitted to the curve? For the core set of subjects and for many of the more popular subjects, the data seems to indicate this. Maybe the “normal” distribution of marks is down to the “normal” distribution of abilities of the student population each year, or have grades been normalised in some way each year, for years, even decades?

For this analysis I’ve used a variety of tools including Excel, Python and Oracle Analytics.

The Dataset used can be found under Dataset menu, and listed as ‘Leaving Certificate 2015-2022’. An additional Dataset, I’ll be adding soon, will be for CAO Points Profiles 2015-2022.

Data Science (The MIT Press Essential Knowledge series) – available in English, Korean and Chinese

Posted on Updated on

Back in the middle of 2018 MIT Press published my Data Science book, co-written with John Kelleher. It book was published as part of their Essentials Series.

During the few months it was available in 2018 it became a best seller on Amazon, and one of the top best selling books for MIT Press. This happened again in 2019. Yes, two years running it has been a best seller!

2020 kicks off with the book being translated into Korean and Chinese. Here are the covers of these translated books.

The Japanese and Turkish translations will be available in a few months!

Go get the English version of the book on Amazon in print, Kindle and Audio formats.

https://amzn.to/2qC84KN

This book gives a concise introduction to the emerging field of data science, explaining its evolution, relation to machine learning, current uses, data infrastructure issues and ethical challenge the goal of data science is to improve decision making through the analysis of data. Today data science determines the ads we see online, the books and movies that are recommended to us online, which emails are filtered into our spam folders, even how much we pay for health insurance.

Go check it out.

Amazon.com.          Amazon.co.uk

Screenshot 2020-02-05 11.46.03

Data Profiling in Python

Posted on Updated on

With every data analytics and data science project, one of the first tasks to that everyone needs to do is to profile the data sets. Data profiling allows you to get an initial picture of the data set, see data distributions and relationships. Additionally it allows us to see what kind of data cleaning and data transformations are necessary.

Most data analytics tools and languages have some functionality available to help you. Particular the various data science/machine learning products have this functionality built-in them and can do a lot of the data profiling automatically for you. But if you don’t use these tools/products, then you are probably using R and/or Python to profile your data.

With Python you will be working with the data set loaded into a Pandas data frame. From there you will be using various statistical functions and graphing functions (and libraries) to create a data profile. From there you will probably create a data profile report.

But one of the challenges with doing this in Python is having different coding for handling numeric and character based attributes/features. The describe function in Python (similar to the summary function in R) gives some statistical summaries for numeric attributes/features. A different set of functions are needed for character based attributes. The Python Library repository (https://pypi.org/) contains over 200K projects. But which ones are really useful and will help with your data science projects. Especially with new projects and libraries being released on a continual basis? This is a major challenge to know what is new and useful.

For example the followings shows loading the titanic data set into a Pandas data frame, creating a subset and using the describe function in Python.

import pandas as pd

df = pd.read_csv("/Users/brendan.tierney/Dropbox/4-Datasets/titanic/train.csv")

df.head(5)

Screenshot 2019-11-22 16.58.39

df2 = df.iloc[:,[1,2,4,5,6,7,8,10,11]]
df2.head(5)

Screenshot 2019-11-22 16.59.30

df2.describe()

Screenshot 2019-11-22 17.00.17

You will notice the describe function has only looked at the numeric attributes.

One of those 200+k Python libraries is one called pandas_profiling. This will create a data audit report for both numeric and character based attributes. This most be good, Right?  Let’s take a look at what it does.

For each column the following statistics – if relevant for the column type – are presented in an interactive HTML report:

  • Essentials: type, unique values, missing values
  • Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
  • Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
  • Most frequent values
  • Histogram
  • Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
  • Missing values matrix, count, heatmap and dendrogram of missing values

The first step is to install the pandas_profiling library.

pip3 install pandas_profiling

Now run the pandas_profiling report for same data frame created and used, see above.

import pandas_profiling as pp

df2.profile_report()

The following images show screen shots of each part of the report. Click and zoom into these to see more details.

Screenshot 2019-11-22 17.29.00Screenshot 2019-11-22 17.29.46

Screenshot 2019-11-22 17.30.57Screenshot 2019-11-22 17.31.32

Screenshot 2019-11-22 17.31.57Screenshot 2019-11-22 17.32.31

Screenshot 2019-11-22 17.33.02

 

Managing imbalanced Data Sets with SMOTE in Python

Posted on Updated on

When working with data sets for machine learning, lots of these data sets and examples we see have approximately the same number of case records for each of the possible predicted values. In this kind of scenario we are trying to perform some kind of classification, where the machine learning model looks to build a model based on the input data set against a target variable. It is this target variable that contains the value to be predicted. In most cases this target variable (or feature) will contain binary values or equivalent in categorical form such as Yes and No, or A and B, etc or may contain a small number of other possible values (e.g. A, B, C, D).

For the classification algorithm to perform optimally and be able to predict the possible value for a new case record, it will need to see enough case records for each of the possible values. What this means, it would be good to have approximately the same number of records for each value (there are many ways to overcome this and these are outside the score of this post). But most data sets, and those that you will encounter in real life work scenarios, are never balanced, as in having a 50-50 split. What we typically encounter might be a 90-10, 98-2, etc type of split. These data sets are said to be imbalanced.

Screenshot 2019-05-20 15.34.14

The image above gives examples of two approaches for creating a balanced data set. The first is under-sampling. This involves reducing the class that contains the majority of the case records and reducing it to match the number of case records in the minor class. The problems with this include, the resulting data set is too small to be meaningful, the case records removed could contain important records and scenarios that the model will need to know about.

The second example is creating a balanced data set by increasing the number of records in the minority class. There are a few approaches to creating this. The first approach is to create duplicate records, from the minor class, until such time as the number of case records are approximately the same for each class. This is the simplest approach. The second approach is to create synthetic records that are statistically equivalent of the original data set. A commonly technique used for this is called SMOTE, Synthetic Minority Oversampling Technique. SMOTE uses a nearest neighbors algorithm to generate new and synthetic data we can use for training our model. But one of the issues with SMOTE is that it will not create sample records outside the bounds of the original data set. As you can image this would be very difficult to do.

The following examples will illustrate how to perform Under-Sampling and Over-Sampling (duplication and using SMOTE) in Python using functions from Pandas, Imbalanced-Learn and Sci-Kit Learn libraries.

NOTE: The Imbalanced-Learn library (e.g. SMOTE)requires the data to be in numeric format, as it statistical calculations are performed on these. The python function get_dummies was used as a quick and simple to generate the numeric values. Although this is perhaps not the best method to use in a real project. With the other sampling functions can process data sets with a sting and numeric.

Data Set:  Is the Portuaguese Banking data set and is available on the UCI Data Set Repository, and many other sites. Here are some basics with that data set.

import warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
get_ipython().magic('matplotlib inline')

bank_file = ".../bank-additional-full.csv"

# import dataset
df = pd.read_csv(bank_file, sep=';',)

# get basic details of df (num records, num features)
df.shape
df['y'].value_counts() # dataset is imbalanced with majority of class label as "no".
no     36548
yes     4640
Name: y, dtype: int64
#print bar chart
df.y.value_counts().plot(kind='bar', title='Count (target)');

Screenshot 2019-05-27 09.38.36

Example 1a – Down/Under sampling the majority class y=1 (using random sampling)

count_class_0, count_class_1 = df.y.value_counts()

# Divide by class
df_class_0 = df[df['y'] == 0] #majority class
df_class_1 = df[df['y'] == 1] #minority class

# Sample Majority class (y=0, to have same number of records as minority calls (y=1)
df_class_0_under = df_class_0.sample(count_class_1)
# join the dataframes containing y=1 and y=0
df_test_under = pd.concat([df_class_0_under, df_class_1])

print('Random under-sampling:')
print(df_test_under.y.value_counts())
print("Num records = ", df_test_under.shape[0])

df_test_under.y.value_counts().plot(kind='bar', title='Count (target)');
Random under-sampling:
1    4640
0    4640
Name: y, dtype: int64
Num records =  9280

Screenshot 2019-05-27 09.41.06

Example 1b – Down/Under sampling the majority class y=1 using imblearn

from imblearn.under_sampling import RandomUnderSampler

X = df_new.drop('y', axis=1)
Y = df_new['y']

rus = RandomUnderSampler(random_state=42, replacement=True)
X_rus, Y_rus = rus.fit_resample(X, Y)

df_rus = pd.concat([pd.DataFrame(X_rus), pd.DataFrame(Y_rus, columns=['y'])], axis=1)

print('imblearn over-sampling:')
print(df_rus.y.value_counts())
print("Num records = ", df_rus.shape[0])

df_rus.y.value_counts().plot(kind='bar', title='Count (target)');

[same results as Example 1a]

Example 1c – Down/Under sampling the majority class y=1 using Sci-Kit Learn

from sklearn.utils import resample

print("Original Data distribution")
print(df['y'].value_counts())

# Down Sample Majority class
down_sample = resample(df[df['y']==0],
replace = True, # sample with replacement
n_samples = df[df['y']==1].shape[0], # to match minority class
random_state=42) # reproducible results

# Combine majority class with upsampled minority class
train_downsample = pd.concat([df[df['y']==1], down_sample])

# Display new class counts
print('Sci-Kit Learn : resample : Down Sampled data set')
print(train_downsample['y'].value_counts())
print("Num records = ", train_downsample.shape[0])
train_downsample.y.value_counts().plot(kind='bar', title='Count (target)');

[same results as Example 1a]

Example 2 a – Over sampling the minority call y=0 (using random sampling)

df_class_1_over = df_class_1.sample(count_class_0, replace=True)
df_test_over = pd.concat([df_class_0, df_class_1_over], axis=0)

print('Random over-sampling:')
print(df_test_over.y.value_counts())

df_test_over.y.value_counts().plot(kind='bar', title='Count (target)');
Random over-sampling:
1    36548
0    36548
Name: y, dtype: int64

Screenshot 2019-05-27 09.46.08

Example 2b – Over sampling the minority call y=0 using SMOTE

from imblearn.over_sampling import SMOTE

print(df_new.y.value_counts())
X = df_new.drop('y', axis=1)
Y = df_new['y']

sm = SMOTE(random_state=42)
X_res, Y_res = sm.fit_resample(X, Y)

df_smote_over = pd.concat([pd.DataFrame(X_res), pd.DataFrame(Y_res, columns=['y'])], axis=1)

print('SMOTE over-sampling:')
print(df_smote_over.y.value_counts())

df_smote_over.y.value_counts().plot(kind='bar', title='Count (target)');

[same results as Example 2a]

Example 2c – Over sampling the minority call y=0 using Sci-Kit Learn

from sklearn.utils import resample

print("Original Data distribution")
print(df['y'].value_counts())

# Upsample minority class
train_positive_upsample = resample(df[df['y']==1],
replace = True, # sample with replacement
n_samples = train_zero.shape[0], # to match majority class
random_state=42) # reproducible results

# Combine majority class with upsampled minority class
train_upsample = pd.concat([train_negative, train_positive_upsample])

# Display new class counts
print('Sci-Kit Learn : resample : Up Sampled data set')
print(train_upsample['y'].value_counts())
train_upsample.y.value_counts().plot(kind='bar', title='Count (target)');

[same results as Example 2a]