Python Data Profiling libraries

Posted on April 4, 2022 Updated on January 22, 2024

One of the most common, and sometimes boring, task when working with datasets is writing some code to profile the data. Most data scientists will have built a set of tools/scripts to help them with this regular and slightly boring task. As with most IT tasks we should be trying to automate what we can, to allow us to spend more time on more important tasks, such as deriving insights and delivering value to the business, instead of repeatedly writing code to produce various statistics about the data and drawing pretty pictures.

I’ve written previously about automating and using some data profiling libraries to help us with this task. There are lots of packages available on pypi.og and on GitHub. Below I give examples of 5 Python Data Profiling libraries, with links to their GitHubs.

pandas_profiling

This is probably one of the better and more popular Python libraries for exploring data. The aim is to make it as simple as possible using one line of code.

pandas-profiling package naming was changed. To continue profiling data use ydata-profiling instead

import pandas_profiling as pp

df2.profile_report()

2. skimpy

Following the line line of code approach skimpy is a light weight tool that provides summary statistics about variables in data frames. They like to thing skimpy is a super-charged version of df.describe(). Skimpy also has some automated data cleaning functions.

from skimpy import skim

skim(df)

3. dataprep

Dataprep has multiple features with the two main features being EDA (Exploratory Data Analysis) and Data Cleaning. For EDA functionality, it is build to scale for larger data sets and provides some interactive charts.

from dataprep.eda import *
from dataprep.datasets import load_dataset
from dataprep.eda import plot, plot_correlation, plot_missing, plot_diff, create_report

df = load_dataset("titanic")
plot(df)

plot_missing(df)
plot_missing(df, "Age")

4. SweetViz

Sweetviz creates high-density visualizations to help kickstart EDA with just two lines of code. Output is a fully self-contained HTML application.

import pandas as pd
import sweetviz as sv

df = pd.read_csv('../input/titanic/train.csv')
report = sweetviz.analyze(df, "Survived")

5. AutoViz

Autoviz works on visualizing the relationship of the data, it can find the most impactful features and plot creative visualization.

from autoviz.AutoViz_Class import AutoViz_Class
AV = AutoViz_Class()

df = AV.AutoViz('titanic_train.csv')

Always try to automate the boring tasks, and using one of these packages is a step towards doing for for any Data Analysts, Data Sciences, Data Engineers, Machine Learning Engineer, AI Engineer, etc.

This entry was posted in BI, Data Science, Python, statistics and tagged Data Science, EDA, Exploratory Data Analysis, Python.

2 thoughts on “Python Data Profiling libraries”

Python Data Profiling libraries – NoteForDataScience said:
April 5, 2022 at 6:32 pm

[…] Python Data Profiling libraries […]

LikeLike

Data Profiling in Python – Curated SQL said:
April 5, 2022 at 1:10 pm

[…] Brendan Tierney uses Python to look at some data: […]

Comments are closed.

Ora-lytics

By Brendan Tierney