Data Science

Irish Whiskey Distilleries Data Set

Posted on Updated on

I’ve been building some Irish Whiskey data sets, and the first of these data sets contains details of all the Whiskey Distilleries in Ireland. This page contains the following:

  • Table describing the attributes/features of the data set
  • Data set, in a scroll able region
  • Download data set in different formats
  • Map of the Distilleries
  • Subscribe to Twitter List containing these Distilleries, and some Twitter Hash Tags
  • How to send me updates, corrections and details of Distilleries I’ve missed

If you use this data set (and my other data sets) make sure to add a reference back to data set webpage. Let me know if you use the data set is an interesting way, share the details with me and I’ll share it on my blog and social media for you.

This data set will have it’s own Irish Distilleries webpage and all updates to the data set and other information will be made there. Check out that webpage for the latest version of things.

Data Set Description

Data set contains 45 Distilleries.

ATTRIBUTE NAME DESCRIPTION
Distillery Name of the Distillery
County County / Area where distillery is located
Address Full address of the distillery
EIRCODE EirCode for distillery in Ireland. Distilleries in Northern Ireland will not have an EIRCODE
NI_Postcode Post code of distilleries located in Northern Ireland
Tours Does the distillery offer tours to visitors (Yes/No)
Web_Site Web site address
Twitter The twitter name of the distillery
Lat Latitude for the distillery
Long Longitude for the distillery
Notes_Parent_Company Contains other descriptive information about the distillery, founded by, parent company, etc.

Data Set (scroll able region)

Data set contains 45 Distilleries.

DISTILLERY COUNTY ADDRESS EIRCODE NI_POSTCODE TOURS WEB_SITE TWITTER LAT LONG NOTES_PARENT_COMPANY
Ballykeefe Distillery Kilkenny Kyle, Ballykeefe, Cuffsgrange, County Kilkenny, R95 NR50, Ireland R95 NR50 Yes https://ballykeefedistillery.ie  @BallykeefeD 52.602034 -7.375774 Ging Family
Belfast Distillery Antrim Crumlin Road Goal, Crumlin Road, Belfast, BT14 6ST, United Kingdom BT14 6ST No http://www.belfastdistillery.com  @BDCIreland 54.609718 -5.941994 J&J McConnell
Blacks Distillery Cork Farm Lane, Kinsale, Co. Cork P17 XW70 No https://www.blacksbrewery.com  @BlacksBrewery 51.710969 -8.515579
Blackwater Waterford Church Road, Ballinlevane East, Ballyduff, Co. Waterford, P51 C5C6 P51 C5C6 No https://blackwaterdistillery.ie/  @BlackDistillery 52.147581 -8.052973
Boann Louth Lagavooren, Platin Rd., Drogheda, Co. Louth, A92 X593 A92 X593 Yes http://boanndistillery.ie/  @Boanndistillery 53.69459 -6.366558 Cooney Family
Bow Street Dublin Bow St, Smithfield Village, Dublin 7 D07 N9VH Yes https://www.jamesonwhiskey.com/en-IE/visit-us/jameson-distillery-bow-st  @jamesonireland 53.348415 -6.277266 Pernod Ricard
Bushmills Distillery Antrim 2 Distillery Rd, Bushmills BT57 8XH, United Kingdom BT57 8XH Yes https://bushmills.com  @BushmillsGlobal 55.202936 -6.517221
Cape Clear Cork Cape Clear Island, Knockannamaurnagh, Skibbereen, Co. Cork P81 RX70 No https://www.capecleardistillery.com/  @capedistillery 51.4509 -9.483047
Clonakilty Cork The Waterfront, Clonakilty, Co. Cork P85 EW82 Yes https://www.clonakiltydistillery.ie/  @clondistillery 51.62165 -8.8855 Scully Family
Connacht Whiskey Distillery Mayo Belleek, Ballina, Co Mayo, F26 P932 F26 P932 Yes https://connachtwhiskey.com  @connachtwhiskey 54.122131 -9.143779
Cooley Distillery Louth Dundalk Rd, Maddox Garden, Carlingford, Dundalk, Co. Louth A91 FX98 Yes 53.996544 -6.221563 Beam Suntory
Copeland Distillery Down 43 Manor Street, Donaghadee, Co Down, Northern Ireland, BT21 0HG BT21 0HG Yes https://copelanddistillery.com @CopelandDistill 54.642699 -5.532739
Dingle Distillery Kerry Farranredmond, DIngle, Co. Kerry V92 E7YD Yes https://dingledistillery.ie/  @DingleWhiskey 52.141928 -10.289287
Dublin Liberties Dublin 33 Mill Street, Dublin 8, D08 V221 D08 V221 Yes https://thedld.com  @WeAreTheDLD 53.337343 -6.276367
Echlinville Distillery Down 62 Gransha Rd, Kircubbin, Newtownards BT22 1AJ, United Kingdom BT22 1AJ Yes https://echlinville.com/  @Echlinville 54.46909 -5.509397
Glendalough Wicklow Unit 9 Newtown Business And Enterprise Centre, Newtown Mount Kennedy, Co. Wicklow, A63 A439 A63 A439 No https://www.glendaloughdistillery.com/  @GlendaloughDist 53.085011 -6.1074016 Mark Anthony Brands International
Great Northern Distillery Louth Carrickmacross Road, Dundalk, Co. Louth, Ireland, A91 P8W9 A91 P8W9 No https://gndireland.com/  @GNDistillery 54.001574 -6.40964 Teeling Family, formally of Cooley Distillery
Hinch Distillery Down 19 Carryduff Road, Boardmills, Ballynahinch, Down, United Kingdom BT27 6TZ No https://hinchdistillery.com/  @hinchdistillery 54.461021 -5.903713
Kilbeggan Distillery Westmeath Lower Main St, Aghamore, Kilbeggan, Co. Westmeath, Ireland N91 W67N Yes https://www.kilbegganwhiskey.com  @Kilbeggan 53.369369 -7.502809 Beam Suntory
Kinahan’s Distillery Dublin 44 Fitzwilliam Place, Dublin D02 P027 No https://kinahanswhiskey.com @KinahansLL Sources Whiskey from around ireland
Lough Gill Sligo Hazelwood Avenue, Cams, Co. Sligo F91 Y820 F91 Y820 Yes https://www.athru.com/  @athruwhiskey 54.255318 -8.433156
Lough Mask Mayo Drioglann Loch Measc Teo, Killateeaun, Tourmakeady, Co. Mayo F12 PK75 Yes https://www.loughmaskdistillery.com/  @lough_mask 53.611819 -9.444077 David Raethorne
Lough Ree Longford Main Street, Lanesborough, Co. Longford N39 P229 No https://www.lrd.ie  @LoughReeDistill 53.673328 -7.99043
Matt D’Arcy Down 27 St Marys St, Newry BT34 2AA, United Kingdom BT34 2AA No http://www.mattdarcys.com  @mattdarcys 54.172817 -6.339367
Midleton Distillery Cork Old Midleton Distillery, Distillery Walk, Midleton, Co. Cork.  P25 Y394 P25 Y394 Yes https://www.jamesonwhiskey.com/en-IE/visit-us/jameson-distillery-midleton  @jamesonireland 51.916344 -8.165174 Pernod Ricard
Nephin Mayo Nephin Whiskey Company, Nephin Square, Lahardane, Co. Mayo F26 W2H9 No http://nephinwhiskey.com/  @NephinWhiskey 54.029011 -9.32211
Pearse Lyons Distillery Dublin 121-122 James’s Street Dublin 8, D08 ET27 D08 ET27 Yes https://www.pearselyonsdistillery.com  @PLDistillery 53.343708 -6.289351
Powerscourt Wicklow Powerscourt Estate, Enniskerry, Co. Wicklow, A98 A9T7 A98 A9T7 Yes https://powerscourtdistillery.com/  @PowerscourtDist 53.184167 -6.190794
Rademon Estate Distillery Down Rademon Estate Distillery, Downpatrick, County Down, United Kingdom BT30 9HR Yes https://rademonestatedistillery.com  @RademonEstate 54.396039 -5.790968
Roe & Co Dublin 92 James’s Street, Dublin 8 D08 YYW9 Yes https://www.roeandcowhiskey.com 53.343731 -6.285673
Royal Oak Distillery Carlow Clorusk Lower, Royaloak, Co. Carlow R21 KR23 Yes https://royaloakdistillery.com/  @royaloakwhiskey 52.703341 -6.978711 Saronno
Scotts Irish Distillery Fermanagh Main Street, Garrison, Co Fermanagh, BT93 4ER, United Kingdom BT93 4ER No http://scottsirish.com 54.417726 -8.083534
Skellig Six 18 Distillery Kerry Valentia Rd, Garranearagh, Cahersiveen, Co. Kerry, V23 YD89 V23 YD89 Yes https://skelligsix18distillery.ie  @SkelligSix18 51.935701 -10.239549
Slane Castle Distillery Meath Slane Castle, Slane, Co. Meath C15 F224 Yes https://www.slaneirishwhiskey.com/  @slanewhiskey 53.711065 -6.562735 Brown-Forman & Conyngham Family
Sliabh Liag Donegal Line Road, Carrick, Co Donegal, F94 X9DX F94 X9DX Yes https://www.sliabhliagdistillers.com/  @sliabhliagdistl 54.6545 -8.633847
Teeling Whiskey Distillery Dublin 13-17 Newmarket, The Liberties, Dublin 8, D08 KD91 D08 KD91 Yes https://teelingwhiskey.com/  @TeelingWhiskey 53.337862 -6.277123 Teeling Family
The Quiet Man Derry 10 Rossdowney Rd, Londonderry BT47 6NS, United Kingdom BT47 6NS No http://www.thequietmanirishwhiskey.com/  @quietmanwhiskey 54.995344 -7.301312 Niche Drinks
The Shed Distillery Leitrim Carrick on shannon Road, Drumshanbo, Co. Leitrim N41 R6D7 No http://thesheddistillery.com/  @SHEDDISTILLERY 54.047145 -8.04358
Thomond Gate Distillery Limerick No https://thomondgatewhiskey.com/ @ThomondW Nicholas Ryan
Tipperary Tipperary Newtownadam, Cahir, Co. Tipperary No http://tipperarydistillery.ie/  @TippDistillery 52.358622 -7.881875
Tullamore Distillery Offaly Bury Quay, Tullamore, Co. Offaly R35 XW13 Yes https://www.tullamoredew.com  @TullamoreDEW 53.377774 -7.492944
Walsh Whiskey Distillery Carlow Equity House, Deerpark Business Park, Dublin Rd, Carlow R93 K7W4 No http://walshwhiskey.com/  @walshwhiskey 52.853417 -6.883916 Walsh Family
Waterford Distillery Waterford 9 Mary Street, Grattan Quay, Waterford City, Co. Waterford X91 KF51 No https://waterfordwhisky.com/  @waterforddram 52.264308 -7.120997 Renegade Spirits Ireland Ltd
Wayward Irish Distillery Kerry Lakeview House & Estate, Fossa Road, Maulagh, Killarney, Co. Kerry, V93 F7Y5 V93 F7Y5 No https://www.waywardirish.com  @wayward_irish 52.071045 -9.590709 O’Connell Fomily
West Cork Distillers Cork Marsh Rd, Marsh, Skibbereen, Co. Cork P81 YY31 No http://www.westcorkdistillers.com/  @WestCorkDistill 51.557804 -9.268941

Download Data Set

Irish_Whiskey_Distilleries – Excel Spreadsheet

Irish_Whiskey_Distilleries.csv – Zipped CSV file

I’ll be adding some additional formats soon.

Map of Distilleries

Here is a map with the Distilleries plotted using Google Maps.

Screenshot 2020-02-13 15.22.40

Twitter Lists & Twitter Hash Tags

I’ve created a Twitter list containing the twitter accounts for all of these distilleries. You can subscribe to the list to get all the latest posts from these distilleries

Irish Whishkey Distillery Twitter List

Have a look out for these twitter hash tags on a Friday, Saturday or Sunday night, as people from around the world share what whiskeys they are tasting that evening. Irish and Scotish Whiskies are the most common.

#FridayNightDram
#FridayNightSip
#SaturdayNightSip
#SaturdayNightDram
#SundayNightSip
#SundayNightDram

How to send me updates, corrections and details of Distilleries I’ve missed

Let me know, via the my Contact page,  if you see any errors in the data set, especially if I’m missing any distilleries.

Data Science (The MIT Press Essential Knowledge series) – available in English, Korean and Chinese

Posted on Updated on

Back in the middle of 2018 MIT Press published my Data Science book, co-written with John Kelleher. It book was published as part of their Essentials Series.

During the few months it was available in 2018 it became a best seller on Amazon, and one of the top best selling books for MIT Press. This happened again in 2019. Yes, two years running it has been a best seller!

2020 kicks off with the book being translated into Korean and Chinese. Here are the covers of these translated books.

The Japanese and Turkish translations will be available in a few months!

Go get the English version of the book on Amazon in print, Kindle and Audio formats.

https://amzn.to/2qC84KN

This book gives a concise introduction to the emerging field of data science, explaining its evolution, relation to machine learning, current uses, data infrastructure issues and ethical challenge the goal of data science is to improve decision making through the analysis of data. Today data science determines the ads we see online, the books and movies that are recommended to us online, which emails are filtered into our spam folders, even how much we pay for health insurance.

Go check it out.

Amazon.com.          Amazon.co.uk

Screenshot 2020-02-05 11.46.03

Scottish Whisky Data Set – Updated

Posted on Updated on

The Scottish Whiskey data set consist of tasting notes and evaluations from 86 distilleries around Scotland. This data set has been around a long time andwas a promotional site for a book, Whisky Classified: Choosing Single Malts by Flavour. Written by David Wishart of the University of Saint Andrews, the book had its most recent printing in February 2012.

I’ve been using this data set in one of my conference presentations (Planning my Summer Vacation), but to use this data set I need to add 2 new attributes/features to the data set. Each of the attributes are listed below and the last 2 are the attributes I added. These were added to include the converted LAT and LONG comparable with Google Maps and other similar mapping technology.

Attributes include:

  • RowID
  • Distillery
  • Body
  • Sweetness
  • Smoky
  • Medicinal
  • Tobacco
  • Honey
  • Spicy
  • Winey
  • Nutty,
  • Malty,
  • Fruity,
  • Floral,
  • Postcode,
  • Latitude,
  • Longitude
  • lat  — newly added
  • long  — newly added

Here is the link to download and use this updated Scottish Whisky data set.

The original website is no longer available but if you have a look at the Internet Archive you will find the website.

Screenshot 2020-01-23 14.44.53

#GE2020 Comparing Party Manifestos to 2016

Posted on

A few days ago I wrote a blog post about using Python to analyze the 2016 general (government) elections manifestos of the four main political parties in Ireland.

Today the two (traditional) largest parties released their #GE2020 manifestos. You can get them by following these links

The following images show the WordClouds generated for the #GE2020 Manifestos. I used the same Python code used in my previous post. If you want to try this out yourself, all the Python code is there.

First let us look at the WordClouds from Fine Gael.

FG2020
2020 Manifesto
FG_2016
2016 Manifesto

Now for the Fianna Fail WordClouds.

FF2020
2020 Manifesto
FF_2016
2016 Manifesto

When you look closely at the differences between the manifestos you will notice there are some common themes across the manifestos from 2016 to those in the 2020 manifestos. It is also interesting to see some new words appearing/disappearing for the 2020 manifestos. Some of these are a little surprising and interesting.

#GE2020 Analysing Party Manifestos using Python

Posted on

The general election is underway here in Ireland with polling day set for Saturday 8th February. All the politicians are out campaigning and every day the various parties are looking for publicity on whatever the popular topic is for that day. Each day is it a different topic.

Most of the political parties have not released their manifestos for the #GE2020 election (as of date of this post). I want to use some simple Python code to perform some analyse of their manifestos. As their new manifestos weren’t available (yet) I went looking for their manifestos from the previous general election. Michael Pidgeon has a website with party manifestos dating back to the early 1970s, and also has some from earlier elections. Check out his website.

I decided to look at manifestos from the 4 main political parties from the 2016 general election. Yes there are other manifestos available, and you can use the Python code, given below to analyse those, with only some minor edits required.

The end result of this simple analyse is a WordCloud showing the most commonly used words in their manifestos. This is graphical way to see what some of the main themes and emphasis are for each party, and also allows us to see some commonality between the parties.

Let’s begin with the Python code.

1 – Initial Setup

There are a number of Python Libraries available for processing PDF files. Not all of them worked on all of the Part Manifestos PDFs! It kind of depends on how these files were generated. In my case I used the pdfminer library, as it worked with all four manifestos. The common library PyPDF2 didn’t work with the Fine Gael manifesto document.

import io
import pdfminer
from pprint import pprint
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage

#directory were manifestos are located
wkDir = '.../General_Election_Ire/'

#define the names of the Manifesto PDF files & setup party flag
pdfFile = wkDir+'FGManifesto16_2.pdf'
party = 'FG'
#pdfFile = wkDir+'Fianna_Fail_GE_2016.pdf'
#party = 'FF'
#pdfFile = wkDir+'Labour_GE_2016.pdf'
#party = 'LB'
#pdfFile = wkDir+'Sinn_Fein_GE_2016.pdf'
#party = 'SF'

All of the following code will run for a given manifesto. Just comment in or out the manifesto you are interested in. The WordClouds for each are given below.

2 – Load the PDF File into Python

The following code loops through each page in the PDF file and extracts the text from that page.

I added some addition code to ignore pages containing the Irish Language. The Sinn Fein Manifesto contained a number of pages which were the Irish equivalent of the preceding pages in English. I didn’t want to have a mixture of languages in the final output.

SF_IrishPages = [14,15,16,17,18,19,20,21,22,23,24]
text = ""

pageCounter = 0
resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle)
page_interpreter = PDFPageInterpreter(resource_manager, converter)

for page in PDFPage.get_pages(open(pdfFile,'rb'), caching=True, check_extractable=True):
    if (party == 'SF') and (pageCounter in SF_IrishPages):
        print(party+' - Not extracting page - Irish page', pageCounter)
    else:
        print(party+' - Extracting Page text', pageCounter)
        page_interpreter.process_page(page)

        text = fake_file_handle.getvalue()

    pageCounter += 1

print('Finished processing PDF document')
converter.close()
fake_file_handle.close()
FG - Extracting Page text 0
FG - Extracting Page text 1
FG - Extracting Page text 2
FG - Extracting Page text 3
FG - Extracting Page text 4
FG - Extracting Page text 5
...

3 – Tokenize the Words

The next step is to Tokenize the text. This breaks the text into individual words.

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
tokens = []

tokens = word_tokenize(text)

print('Number of Pages =', pageCounter)
print('Number of Tokens =',len(tokens))
Number of Pages = 140
Number of Tokens = 66975

4 – Filter words, Remove Numbers & Punctuation

There will be a lot of things in the text that we don’t want included in the analyse. We want the text to only contain words. The following extracts the words and ignores numbers, punctuation, etc.

#converts to lower case, and removes punctuation and numbers
wordsFiltered = [tokens.lower() for tokens in tokens if tokens.isalpha()]
print(len(wordsFiltered))
print(wordsFiltered)
58198
['fine', 'gael', 'general', 'election', 'manifesto', 's', 'keep', 'the', 'recovery', 'going', 'gaelgeneral', 'election', 'manifesto', 'foreward', 'from', 'an', 'taoiseach', 'the', 'long', 'term', 'economic', 'three', 'steps', 'to', 'keep', 'the', 'recovery', 'going', 'agriculture', 'and', 'food', 'generational',
...

As you can see the number of tokens has reduced from 66,975 to 58,198.

5 – Setup Stop Words

Stop words are general words in a language that doesn’t contain any meanings and these can be removed from the data set. Python NLTK comes with a set of stop words defined for most languages.

#We initialize the stopwords variable which is a list of words like 
#"The", "I", "and", etc. that don't hold much value as keywords
stop_words = stopwords.words('english')
print(stop_words)
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself',
....

Additional stop words can be added to this list. I added the words listed below. Some of these you might expect to be in the stop word list, others are to remove certain words that appeared in the various manifestos that don’t have a lot of meaning. I also added the name of the parties  and some Irish words to the stop words list.

#some extra stop words are needed after examining the data and word cloud
#these are added
extra_stop_words = ['ireland','irish','ł','need', 'also', 'set', 'within', 'use', 'order', 'would', 'year', 'per', 'time', 'place', 'must', 'years', 'much', 'take','make','making','manifesto','ð','u','part','needs','next','keep','election', 'fine','gael', 'gaelgeneral', 'fianna', 'fáil','fail','labour', 'sinn', 'fein','féin','atá','go','le','ar','agus','na','ár','ag','haghaidh','téarnamh','bplean','page','two','number','cothromfor']
stop_words.extend(extra_stop_words)
print(stop_words)

Now remove these stop words from the list of tokens.

# remove stop words from tokenised data set
filtered_words = [word for word in wordsFiltered if word not in stop_words]
print(len(filtered_words))
print(filtered_words)
31038
['general', 'recovery', 'going', 'foreward', 'taoiseach', 'long', 'term', 'economic', 'three', 'steps', 'recovery', 'going', 'agriculture', 'food',

The number of tokens is reduced to 31,038

6 – Word Frequency Counts

Now calculate how frequently these words occur in the list of tokens.

#get the frequency of each word
from collections import Counter

# count frequencies
cnt = Counter()
for word in filtered_words:
cnt[word] += 1

print(cnt)
Counter({'new': 340, 'support': 249, 'work': 190, 'public': 186, 'government': 177, 'ensure': 177, 'plan': 176, 'continue': 168, 'local': 150, 
...

7 – WordCloud

We can use the word frequency counts to add emphasis to the WordCloud. The more frequently it occurs the larger it will appear in the WordCloud.

#create a word cloud using frequencies for emphasis 
from wordcloud import WordCloud
import matplotlib.pyplot as plt

wc = WordCloud(max_words=100, margin=9, background_color='white',
scale=3, relative_scaling = 0.5, width=500, height=400,
random_state=1).generate_from_frequencies(cnt)

plt.figure(figsize=(20,10))
plt.imshow(wc)
#plt.axis("off")
plt.show()

#Save the image in the img folder:
wc.to_file(wkDir+party+"_2016.png")

The last line of code saves the WordCloud image as a file in the directory where the manifestos are located.

8 – WordClouds for Each Party

Screenshot 2020-01-21 11.10.25

Remember these WordClouds are for the manifestos from the 2016 general election.

When the parties have released their manifestos for the 2020 general election, I’ll run them through this code and produce the WordClouds for 2020. It will be interesting to see the differences between the 2016 and 2020 manifesto WordClouds.

Data Profiling in Python

Posted on Updated on

With every data analytics and data science project, one of the first tasks to that everyone needs to do is to profile the data sets. Data profiling allows you to get an initial picture of the data set, see data distributions and relationships. Additionally it allows us to see what kind of data cleaning and data transformations are necessary.

Most data analytics tools and languages have some functionality available to help you. Particular the various data science/machine learning products have this functionality built-in them and can do a lot of the data profiling automatically for you. But if you don’t use these tools/products, then you are probably using R and/or Python to profile your data.

With Python you will be working with the data set loaded into a Pandas data frame. From there you will be using various statistical functions and graphing functions (and libraries) to create a data profile. From there you will probably create a data profile report.

But one of the challenges with doing this in Python is having different coding for handling numeric and character based attributes/features. The describe function in Python (similar to the summary function in R) gives some statistical summaries for numeric attributes/features. A different set of functions are needed for character based attributes. The Python Library repository (https://pypi.org/) contains over 200K projects. But which ones are really useful and will help with your data science projects. Especially with new projects and libraries being released on a continual basis? This is a major challenge to know what is new and useful.

For example the followings shows loading the titanic data set into a Pandas data frame, creating a subset and using the describe function in Python.

import pandas as pd

df = pd.read_csv("/Users/brendan.tierney/Dropbox/4-Datasets/titanic/train.csv")

df.head(5)

Screenshot 2019-11-22 16.58.39

df2 = df.iloc[:,[1,2,4,5,6,7,8,10,11]]
df2.head(5)

Screenshot 2019-11-22 16.59.30

df2.describe()

Screenshot 2019-11-22 17.00.17

You will notice the describe function has only looked at the numeric attributes.

One of those 200+k Python libraries is one called pandas_profiling. This will create a data audit report for both numeric and character based attributes. This most be good, Right?  Let’s take a look at what it does.

For each column the following statistics – if relevant for the column type – are presented in an interactive HTML report:

  • Essentials: type, unique values, missing values
  • Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
  • Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
  • Most frequent values
  • Histogram
  • Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
  • Missing values matrix, count, heatmap and dendrogram of missing values

The first step is to install the pandas_profiling library.

pip3 install pandas_profiling

Now run the pandas_profiling report for same data frame created and used, see above.

import pandas_profiling as pp

df2.profile_report()

The following images show screen shots of each part of the report. Click and zoom into these to see more details.

Screenshot 2019-11-22 17.29.00Screenshot 2019-11-22 17.29.46

Screenshot 2019-11-22 17.30.57Screenshot 2019-11-22 17.31.32

Screenshot 2019-11-22 17.31.57Screenshot 2019-11-22 17.32.31

Screenshot 2019-11-22 17.33.02

 

Demographics vs Psychographics for Machine Learning

Posted on Updated on

When preparing data for data science, data mining or machine learning projects you will create a data set that describes the various characteristics of the subject or case record. Each attribute will contain some descriptive information about the subject and is related to the target variable in some way.

In addition to these attributes, the data set will be enriched with various other internal/external data to complete the data set.

Some of the attributes in the data set can be grouped under the heading of Demographics. Demographic data contains attributes that explain or describe the person or event each case record is focused on. For example, if the subject of the case record is based on Customer data, this is the “Who” the demographic data (and features/attributes) will be about. Examples of demographic data include:

  • Age range
  • Marital status
  • Number of children
  • Household income
  • Occupation
  • Educational level

These features/attributes are typically readily available within your data sources and if they aren’t then these name be available from a purchased data set.

Additional feature engineering methods are used to generate new features/attributes that express meaning is different ways. This can be done by combining features in different ways, binning, dimensionality reduction, discretization, various data transformations, etc. The list can go on.

The aim of all of this is to enrich the data set to include more descriptive data about the subject. This enriched data set will then be used by the machine learning algorithms to find the hidden patterns in the data. The richer and descriptive the data set is the greater the likelihood of the algorithms in detecting the various relationships between the features and their values. These relationships will then be included in the created/generated model.

Another approach to consider when creating and enriching your data set is move beyond the descriptive features typically associated with Demographic data, to include Pyschographic data.

Psychographic data is a variation on demographic data where the feature are about describing the habits of the subject or customer.  Demographics focus on the “who” while psycographics focus on the “why”. For example, a common problem with data sets is that they describe subjects/people who have things in common. In such scenarios we want to understand them at a deeper level. Psycographics allows us to do this. Examples of Psycographics include:

  • Lifestyle activities
  • Evening activities
  • Purchasing interests – quality over economy,  how environmentally concerned are you
  • How happy are you with work, family, etc
  • Social activities and changes in these
  • What attitudes you have for certain topic areas
  • What are your principles and beliefs

The above gives a far deeper insight into the subject/person and helps to differentiate each subject/person from each other, when there is a high similarity between all subjects in the data set. For example, demographic information might tell you something about a person’s age, but psychographic information will tell you that the person is just starting a family and is in the market for baby products.

I’ll close with this. Consider the various types of data gathering that companies like Google, Facebook, etc perform. They gather lots of different types of data about individuals. This allows them to build up a complete and extensive profile of all activities for individuals. They can use this to deliver more accurate marketing and advertising. For example, Google gathers data about what places to visit throughout a data, they gather all your search results, and lots of other activities. They can do a lot with this data. but now they own Fitbit. Think about what they can do with that data and particularly when combined with all the other data they have about you. What if they had access to your medical records too!  Go Google this ! You will find articles about them now having access to your health records. Again combine all of the data from these different data sources. How valuable is that data?