With every data analytics and data science project, one of the first tasks to that everyone needs to do is to profile the data sets. Data profiling allows you to get an initial picture of the data set, see data distributions and relationships. Additionally it allows us to see what kind of data cleaning and data transformations are necessary.
Most data analytics tools and languages have some functionality available to help you. Particular the various data science/machine learning products have this functionality built-in them and can do a lot of the data profiling automatically for you. But if you don’t use these tools/products, then you are probably using R and/or Python to profile your data.
With Python you will be working with the data set loaded into a Pandas data frame. From there you will be using various statistical functions and graphing functions (and libraries) to create a data profile. From there you will probably create a data profile report.
But one of the challenges with doing this in Python is having different coding for handling numeric and character based attributes/features. The describe function in Python (similar to the summary function in R) gives some statistical summaries for numeric attributes/features. A different set of functions are needed for character based attributes. The Python Library repository (https://pypi.org/) contains over 200K projects. But which ones are really useful and will help with your data science projects. Especially with new projects and libraries being released on a continual basis? This is a major challenge to know what is new and useful.
For example the followings shows loading the titanic data set into a Pandas data frame, creating a subset and using the describe function in Python.
import pandas as pd df = pd.read_csv("/Users/brendan.tierney/Dropbox/4-Datasets/titanic/train.csv") df.head(5)
df2 = df.iloc[:,[1,2,4,5,6,7,8,10,11]] df2.head(5)
You will notice the describe function has only looked at the numeric attributes.
One of those 200+k Python libraries is one called pandas_profiling. This will create a data audit report for both numeric and character based attributes. This most be good, Right? Let’s take a look at what it does.
For each column the following statistics – if relevant for the column type – are presented in an interactive HTML report:
- Essentials: type, unique values, missing values
- Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
- Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
- Most frequent values
- Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
- Missing values matrix, count, heatmap and dendrogram of missing values
The first step is to install the pandas_profiling library.
pip3 install pandas_profiling
Now run the pandas_profiling report for same data frame created and used, see above.
import pandas_profiling as pp df2.profile_report()
The following images show screen shots of each part of the report. Click and zoom into these to see more details.
Everyone is doing advanced analytics. Right? Hmm
Everyone is talking about advanced analytics? Yes that is true.
Everyone is an expert in advanced analytics? This is so not true. Watch out for these Great Pretenders. You know what I mean! You know who I mean! Maybe you know some of them already? If not, watch out for these Great Pretenders!!!
Some people are going around talking about data mining, predictive analytics, advanced analytics, machine learning etc as if this is some new topic. Well it isn’t. It isn’t anything new and most of the techniques have been about for 10, 20, 30+ years.
Some people are saying you should only use language X or tool Y because. Everything else is basically rubbish.
What we do have is a wider understanding of how to use these techniques on our various data sources.
What we have is a lot more tools that allow us to perform these tasks a lot easier, at greater speed, with more functionality and without the need to fully understand the hard core maths that is going on behind the scenes.
What we have is a lot more languages to perform these tasks and to support the vast amount of work that goes into understanding the data and preparing the data.
Someone thing for all of us to watch out for, when we ready about these topics, is what kind of problem area they are addressing. The following table illustrates the three main types or categories of Analytics. These categories are Descriptive Analytics, Predictive Analytics and Prescriptive Analytics. I think most people would agree that the Descriptive and Predictive Analytics categories are very mature at this stage. With Predictive Analytics we are perhaps still evolving in this category and a lot more work needs to be done before this this become wide spread.
Some people talk as if Predictive Analytics is some new and exciting topic. But isn’t all that new. It was been around for the past 30+ years. If you go back over the Gartner Hype Cycle that comes out every September, Predictive Analytics is no longer being shown on this graph. The last time it appeared on the Gartner Hype Cycle was back in 2013 and it was positioned on the far right of the graph in the section called Plateau of Productivity.
So Predictive Analytics is very mature and main stream. Part of the reason that it is main stream is that Predictive Analytics has allowed for a new category of Analytics to evolve and this is Automatic Analytics.
Automatic Analytics is where Advanced and Predictive Analytics has been build into our day to day applications that are used to run our business. We do not need the hard core type of data scientists to perform various analytic on our data. Instead these task, once they have been defined, can then be added to our applications to process, evaluate and make decisions all automatically. This is were we need the data scientists to be able to communicate with the business and be able to work with them to solve real world business projects. This is a different type of data scientist to the “hard” core data scientist who delves into the various statistical methods, machine learning methods, data management methods, etc.
The following table extends the table given above to include Automatic Analytics, and is my own take on how and where Automatic Analytics fits.
Every time we get an insurance quote, health insurance quote, get a “random” call from our Telco offering a free upgrade, get our loyalty card statements, get a loan from the bank, look at or buy a book on Amazon, etc. the list could go on and on, but these are all examples of how predictive analytics has been automated into our everyday business application.
But this is nothing new. When I first got into data mining/predictive analytics over 16 years ago, it was considered a common thing that certain types of companies did. What has happened in the time since and particularly in the past few years is that a lot more people are seeing the value in using it.
Before I finish off this post we can have a quick look at what Oracle has been doing in this area. They have their Advanced Analytics Option and Real-Time Decisions tools to all data scientists do their magic. But over the past X years (nobody can give me an exact number) they have been very, very active in building in lots and lots of predictive analytics into their various business applications, particularly with into with Fusion Apps and BI Apps.
A recent quote from Oracle highlights their aim with this,
” … products designed to close the gap between data scientists and businesses.“
Now with Oracle making a big push to the cloud, they are busy adding in more and more Automatic (Predictive) Analytics into their Cloud Applications. What we need from Oracle is a clearer identification of where they have done this. Plus with the migration of their Apps to the cloud, their Advanced Analytics Option is a core part of their Cloud platform. As they upgrade or add new features into their Cloud Apps, you will now be able to get the benefit of these Automatic (Predictive) Analytics as they come available.