Python Data Profiling libraries
One of the most common, and sometimes boring, task when working with datasets is writing some code to profile the data. Most data scientists will have built a set of tools/scripts to help them with this regular and slightly boring task. As with most IT tasks we should be trying to automate what we can, to allow us to spend more time on more important tasks, such as deriving insights and delivering value to the business, instead of repeatedly writing code to produce various statistics about the data and drawing pretty pictures.
I’ve written previously about automating and using some data profiling libraries to help us with this task. There are lots of packages available on pypi.og and on GitHub. Below I give examples of 5 Python Data Profiling libraries, with links to their GitHubs.
This is probably one of the better and more popular Python libraries for exploring data. The aim is to make it as simple as possible using one line of code.
import pandas_profiling as pp df2.profile_report()
Following the line line of code approach skimpy is a light weight tool that provides summary statistics about variables in data frames. They like to thing skimpy is a super-charged version of
df.describe(). Skimpy also has some automated data cleaning functions.
from skimpy import skim skim(df)
Dataprep has multiple features with the two main features being EDA (Exploratory Data Analysis) and Data Cleaning. For EDA functionality, it is build to scale for larger data sets and provides some interactive charts.
from dataprep.eda import * from dataprep.datasets import load_dataset from dataprep.eda import plot, plot_correlation, plot_missing, plot_diff, create_report df = load_dataset("titanic") plot(df) plot_missing(df) plot_missing(df, "Age")
Sweetviz creates high-density visualizations to help kickstart EDA with just two lines of code. Output is a fully self-contained HTML application.
import pandas as pd import sweetviz as sv df = pd.read_csv('../input/titanic/train.csv') report = sweetviz.analyze(df, "Survived")
Autoviz works on visualizing the relationship of the data, it can find the most impactful features and plot creative visualization.
from autoviz.AutoViz_Class import AutoViz_Class AV = AutoViz_Class() df = AV.AutoViz('titanic_train.csv')
Always try to automate the boring tasks, and using one of these packages is a step towards doing for for any Data Analysts, Data Sciences, Data Engineers, Machine Learning Engineer, AI Engineer, etc.
i-BI : A new name for real BI
I’ve been working in the BI and related fields since the mid 90s. Over the past number of years I’ve gotten a little bit confused about what Business Intelligence (BI) really means. Maybe it’s just a bit of old age kicking in way too early.
It seems to me that the term Business Intelligence has been hijacked by a large number of companies and software vendors. It seems that every “reporting tool” has been re-labelled into a Business Intelligence tool, without providing any really intelligence features. You are still just a reporting tool with no real intelligence features. Yes you do have some nice graphics that can be used instead of just listing numbers. But that is not Business Intelligence.
Business Intelligence is going beyond what these tools are capable off. Most of the skills and abilities for BI comes from the people who are doing it, not the tools. In reality you will need to use a number of tools or to write some custom code to help you gain the extra bit of insight into your data. The “reporting tools” can then deliver the results.
Also Ralph Kimball said a long time ago that the skills of someone working in the DW/BI area was that they needed to be half-DBA and half-MBA.
A quote that I heard recently from the Predictive Analytics World Conference, was “You need to be able to ask the right question”. This is to ensure that you can frame your analytics projects correctly and be able to measure the results.
I think that this question was key back in the mid 90s when I started out in the BI field and I still think it applies to all areas of BI. The thing that we have lost in BI is the real intelligence part of it.
So I’m proposing a new name for really BI. It is intelligent-Business Intelligence (i-BI).
Lets differentiate between BI and the real intelligent BI work.
What do I mean by intelligent BI (i-BI) ? What I mean area skills in Data Warehousing, Time Series Analysis, Advanced Analytics, Data Mining, Predictive Analysis, solving or addressing real business problems, etc.
Or maybe I’m just wrong and have missed some developments in BI over the past 16+ years. Or maybe I’m becoming a bit too cynical.
What do you think ?
You must be logged in to post a comment.