Data Science

Principal Component Analysis (PCA) in Oracle

Posted on Updated on

Principal Component Analysis (PCA), is a statistical process used for feature or dimensionality reduction in data science and machine learning projects. It summarizes the features of a large data set into a smaller set of features by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data’s variation as possible. There are lots of resources that goes into the mathematics behind this approach. I’m not going to go into that detail here and a quick internet search will get you what you need.

PCA can be used to discover important features from large data sets (large as in having a large number of features), while preserving as much information as possible.

Statistically, PCA finds lines, planes and hyper-planes in the K-dimensional space that approximate the data as well as possible in the least squares sense. A line or plane that is the least squares approximation of a set of data points makes the variance of the coordinates on the line or plane as large as possible.

Oracle has implemented PCA using Sigular Value Decomposition (SVD) on the covariance and correlations between variables, for feature extraction/reduction. PCA is closely related to SVD. PCA computes a set of orthonormal bases (principal components) that are ranked by their corresponding explained variance. The main difference between SVD and PCA is that the PCA projection is not scaled by the singular values. The extracted features are transformed features consisting of linear combinations of the original features.

When machine learning is performed on this reduced set of transformed features, it can completed with less resources and time, while still maintaining accuracy.

Algorithm Name in Oracle using

Mining Model Function = FEATURE_EXTRACTION

Algorithm = ALGO_SINGULAR_VALUE_DECOMP

(Hyper)-Parameters for algorithms

  • SVDS_U_MATRIX_OUTPUT : SVDS_U_MATRIX_ENABLE or SVDS_U_MATRIX_DISABLE
  • SVDS_SCORING_MODE : SVDS_SCORING_SVD or SVDS_SCORING_PCA
  • SVDS_SOLVER : possible values include SVDS_SOLVER_TSSVD, SVDS_SOLVER_TSEIGEN, SVDS_SOLVER_SSVD, SVDS_SOLVER_STEIGEN
  • SVDS_TOLERANCE : range of 0…1
  • SVDS_RANDOM_SEED : range of 0…4294967296 (!)
  • SVDS_OVER_SAMPLING : range of 1…5000
  • SVDS_POWER_ITERATIONS : Default value 2, with possible range of 0…20

Let’s work through an example using the MINING_DATA_BUILD_V data set that comes with Oracle Data Miner.

First step is to define the parameter settings for the algorithm. No data preparation is needed as the algorithm takes care of this. This means you can disable the Automatic Data Preparation (ADP).

-- create the parameter table
CREATE TABLE svd_settings (
setting_name VARCHAR2(30),
setting_value VARCHAR2(4000));

-- define the settings for SVD algorithm
BEGIN 
   INSERT INTO svd_settings (setting_name, setting_value) 
   VALUES (dbms_data_mining.algo_name, dbms_data_mining.algo_singular_value_decomp);

   -- turn OFF ADP
   INSERT INTO svd_settings (setting_name, setting_value) 
   VALUES (dbms_data_mining.prep_auto, dbms_data_mining.prep_auto_off); 

   -- set PCA scoring mode
   INSERT INTO svd_settings (setting_name, setting_value) 
   VALUES (dbms_data_mining.svds_scoring_mode, dbms_data_mining.svds_scoring_pca);

   INSERT INTO svd_settings (setting_name, setting_value) 
   VALUES (dbms_data_mining.prep_shift_2dnum, dbms_data_mining.prep_shift_mean); 

   INSERT INTO svd_settings (setting_name, setting_value) 
   VALUES (dbms_data_mining.prep_scale_2dnum, dbms_data_mining.prep_scale_stddev); 
END;
/

You are now ready to create the model.

BEGIN
   DBMS_DATA_MINING.CREATE_MODEL(
      model_name          => 'SVD_MODEL',
      mining_function     => dbms_data_mining.feature_extraction,
      data_table_name     => 'mining_data_build_v',
      case_id_column_name => 'CUST_ID',
      settings_table_name => 'svd_settings');
END;

When created you can use the mining model data dictionary views to explore the model and to explore the specifics of the model and the various MxN matrix created using the model specific views. These include:

  • DM$VESVD_Model : Singular Value Decomposition S Matrix
  • DM$VGSVD_Model : Global Name-Value Pairs
  • DM$VNSVD_Model : Normalization and Missing Value Handling
  • DM$VSSVD_Model : Computed Settings
  • DM$VUSVD_Model : Singular Value Decomposition U Matrix
  • DM$VVSVD_Model : Singular Value Decomposition V Matrix
  • DM$VWSVD_Model : Model Build Alerts

Where the S, V and U matrix contain:

  • U matrix : consists of a set of ‘left’ orthonormal bases
  • S matrix : is a diagonal matrix
  • V matrix : consists of set of ‘right’ orthonormal bases

These can be explored using the following

-- S matrix
select feature_id, VALUE, variance, pct_cum_variance 
from DM$VESVD_MODEL;

-- V matrix
select feature_id, attribute_name, value
from DM$VVSVD_MODEL
order by feature_id, attribute_name;

-- U matrix
select feature_id, attribute_name, value
from DM$VVSVD_MODEL
order by feature_id, attribute_name;

To determine the projections to be used for visualizations we can use the FEATURE_VALUES function.

select FEATURE_VALUE(svd_sh_sample, 1 USING *) proj1, 
       FEATURE_VALUE(svd_sh_sample, 2 USING *) proj2
from   mining_data_build_v 
where  cust_id <= 101510
order by 1, 2;

 

Other algorithms available in Oracle for feature extraction and reduction include:

  • Non-Negative Matrix Factorization (NMF)
  • Explicit Semantic Analysis (ESA)
  • Minimum Description Length (MDL) – this is really feature selection rather than feature extraction

k-Fold and Repeated k-Fold Cross Validation in Python

Posted on Updated on

When it comes to evaluation the performance of a machine learning model there are a number of different approaches. Plus there are as many different view points on what is the best or better evaluation metric to use.

One of the common approaches is to use k-Fold cross validation. This divides the data in to ‘k‘ non-overlapping parts (or Folds). One of these part/Folds is used for hold out testing and the remaining part/Folds (k-1) are used to train and create a model. This model is then used to applied or fitted to the hold-out ‘k‘ part/Fold. This process is repeated across all the ‘k‘ parts/Folds until all the data has been used. The results from applying or fitting the model are aggregated and the mean performance is report.

Traditionally, ‘k‘ is set to 10 and will be the default value in most/all languages, libraries, packages and application. This number can be changed to anything you want. Most reports indicated a value of between 5 and 10, as these seem to indicate results that don’t suffer from bias or variance.

Let’s take a look at an example of using k-Fold Cross Validation using Scikit-Learning library. First step is to prepare the data.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

bank_file = "/.../4-Datasets/bank-additional-full.csv"

# import dataset
df = pd.read_csv(bank_file, sep=';',)

# get basic details of df (num records, num features)
df.shape

print('Percentage per target class ')
df['y'].value_counts()/len(df) #calculate percentages

#Data Clean up
df = df.drop('duration', axis=1) #this is highly correlated to target variable
df_new = pd.get_dummies(df) #simple and easy approach for categorical variables
df_new.describe()
df['y'] = df['y'].map({'no':0, 'yes':1}) # binary encoding of class label

#split data set into input variables and target variables
## create separate dataframes for Input features (X) and for Target feature (Y)
X_train = df_new.drop('y', axis=1)
Y_train = df_new['y']

Now we can perform k-fold cross valuation.

#load scikit-learn k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

#setup for k-Fold Cross Validation
cv = KFold(n_splits=10, shuffle=True, random_state=1)
#n_splits = number of k-folds
#shuffle = shuffles data set prior to split
#radnom_state = seed for (pseydo)random number generator
#define model
model = LogisticRegression()
#create model, perform cross validation and evaluate model
scores = cross_val_score(model, X_train, Y_train, scoring='accuracy', cv=cv, n_jobs=-1)
#performance result
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

 

We can see from the above example the model is evaluated across 10 folds, giving the accuracy score for each of these. The mean of these 10 accuracy scores is calculated along with the standard deviation, which in this example is very small. You may have slightly different results and this will vary from data set to data set.

The results from k-fold can be nosy, as in each time the code is run a slightly different result may be achieved. This is due to having differing splits of the data set into the k-folds. The model accuracy can vary between each execution and it can be difficult to determine which iteration of the model should be used.

One way to address this possible noise is to estimate the model accurary/performance based on running k-fold a number of times and calculating the performance across all the repeats. This approach is called Repeated k-Fold Cross-Validation. Yes there is a computation cost for performing this approach, and it therefore suited to datasets of smaller scale. In most scenarios having data sets up to 1M records/cases is possible, and depending on the hardware and memory, it can scale to many times that and still be relatively quick to run.

[a small data set for one person could be another persons Big Data set!]

How many repeats should be performed? It kind of depends on how noisy the data is, but in a similar way of having ten as a default value for k, the number of repeats default is ten. Although the typical default is ten, but can be adjusted to say 5, but some testing/experimentation is needed to determine a suitable value.

Building upon the k-fold example code given previously, the following shows can example of using the Repeated k-Fold Cross Validation.

#Repeated k-Fold Cross Validation
#load the necessary libraries
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

#using the same data set created for k-Fold => X_train, Y_train

#Setup and configure settings for Repeated k-Fold CV (k-folds=10, repeats=10)
rcv = RepeatedKFold(n_splits=10, n_repeats=10, random_state=1)

#define model
model = LogisticRegression()

#create model, perform Repeated CV and evaluate model
scores = cross_val_score(model, X_train, Y_train, scoring='accuracy', cv=rcv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

 

[New Book] 97 Things about Data Ethics in Data Science – Collective Wisdom from the Experts

Posted on Updated on

Some months ago I was approached about being part and contributing to a new book on Data Ethics for Data Science. It is now available to purchase on Amazon (and elsewhere), and this book now becomes the Sixth book that I’ve either solely or co-written. Check out my all my books here.

This has been an area I’ve been working in for some time now, in both research and assisting companies. I was able to make a couple of contributions to this book, and there has been great contributions from (other) global experts in Data Science and Data Ethics, and has been edited by Bill Franks.

Most of the high-profile cases of real or perceived unethical activity in data science aren’t matters of bad intent. Rather, they occur because the ethics simply aren’t thought through well enough. Being ethical takes constant diligence, and in many situations identifying the right choice can be difficult.

In this in-depth book, contributors from top companies in technology, finance, and other industries share experiences and lessons learned from collecting, managing, and analyzing data ethically. Data science professionals, managers, and tech leaders will gain a better understanding of ethics through powerful, real-world best practices.

The book is available in paper back and kindle formats and is published by O’Reilly Press.

You might be interested in my previous book on Data Science, part of the MIT Press Essentials Series. This book has been a Best Seller in 2018 and 2019 on Amazon.

 

 

Irish Whiskey Distilleries Data Set

Posted on Updated on

I’ve been building some Irish Whiskey data sets, and the first of these data sets contains details of all the Whiskey Distilleries in Ireland. This page contains the following:

  • Table describing the attributes/features of the data set
  • Data set, in a scroll able region
  • Download data set in different formats
  • Map of the Distilleries
  • Subscribe to Twitter List containing these Distilleries, and some Twitter Hash Tags
  • How to send me updates, corrections and details of Distilleries I’ve missed

If you use this data set (and my other data sets) make sure to add a reference back to data set webpage. Let me know if you use the data set is an interesting way, share the details with me and I’ll share it on my blog and social media for you.

This data set will have it’s own Irish Distilleries webpage and all updates to the data set and other information will be made there. Check out that webpage for the latest version of things.

Data Set Description

Data set contains 45 Distilleries.

ATTRIBUTE NAME DESCRIPTION
Distillery Name of the Distillery
County County / Area where distillery is located
Address Full address of the distillery
EIRCODE EirCode for distillery in Ireland. Distilleries in Northern Ireland will not have an EIRCODE
NI_Postcode Post code of distilleries located in Northern Ireland
Tours Does the distillery offer tours to visitors (Yes/No)
Web_Site Web site address
Twitter The twitter name of the distillery
Lat Latitude for the distillery
Long Longitude for the distillery
Notes_Parent_Company Contains other descriptive information about the distillery, founded by, parent company, etc.

Data Set (scroll able region)

Data set contains 45 Distilleries.

DISTILLERY COUNTY ADDRESS EIRCODE NI_POSTCODE TOURS WEB_SITE TWITTER LAT LONG NOTES_PARENT_COMPANY
Ballykeefe Distillery Kilkenny Kyle, Ballykeefe, Cuffsgrange, County Kilkenny, R95 NR50, Ireland R95 NR50 Yes https://ballykeefedistillery.ie  @BallykeefeD 52.602034 -7.375774 Ging Family
Belfast Distillery Antrim Crumlin Road Goal, Crumlin Road, Belfast, BT14 6ST, United Kingdom BT14 6ST No http://www.belfastdistillery.com  @BDCIreland 54.609718 -5.941994 J&J McConnell
Blacks Distillery Cork Farm Lane, Kinsale, Co. Cork P17 XW70 No https://www.blacksbrewery.com  @BlacksBrewery 51.710969 -8.515579
Blackwater Waterford Church Road, Ballinlevane East, Ballyduff, Co. Waterford, P51 C5C6 P51 C5C6 No https://blackwaterdistillery.ie/  @BlackDistillery 52.147581 -8.052973
Boann Louth Lagavooren, Platin Rd., Drogheda, Co. Louth, A92 X593 A92 X593 Yes http://boanndistillery.ie/  @Boanndistillery 53.69459 -6.366558 Cooney Family
Bow Street Dublin Bow St, Smithfield Village, Dublin 7 D07 N9VH Yes https://www.jamesonwhiskey.com/en-IE/visit-us/jameson-distillery-bow-st  @jamesonireland 53.348415 -6.277266 Pernod Ricard
Bushmills Distillery Antrim 2 Distillery Rd, Bushmills BT57 8XH, United Kingdom BT57 8XH Yes https://bushmills.com  @BushmillsGlobal 55.202936 -6.517221
Cape Clear Cork Cape Clear Island, Knockannamaurnagh, Skibbereen, Co. Cork P81 RX70 No https://www.capecleardistillery.com/  @capedistillery 51.4509 -9.483047
Clonakilty Cork The Waterfront, Clonakilty, Co. Cork P85 EW82 Yes https://www.clonakiltydistillery.ie/  @clondistillery 51.62165 -8.8855 Scully Family
Connacht Whiskey Distillery Mayo Belleek, Ballina, Co Mayo, F26 P932 F26 P932 Yes https://connachtwhiskey.com  @connachtwhiskey 54.122131 -9.143779
Cooley Distillery Louth Dundalk Rd, Maddox Garden, Carlingford, Dundalk, Co. Louth A91 FX98 Yes 53.996544 -6.221563 Beam Suntory
Copeland Distillery Down 43 Manor Street, Donaghadee, Co Down, Northern Ireland, BT21 0HG BT21 0HG Yes https://copelanddistillery.com @CopelandDistill 54.642699 -5.532739
Dingle Distillery Kerry Farranredmond, DIngle, Co. Kerry V92 E7YD Yes https://dingledistillery.ie/  @DingleWhiskey 52.141928 -10.289287
Dublin Liberties Dublin 33 Mill Street, Dublin 8, D08 V221 D08 V221 Yes https://thedld.com  @WeAreTheDLD 53.337343 -6.276367
Echlinville Distillery Down 62 Gransha Rd, Kircubbin, Newtownards BT22 1AJ, United Kingdom BT22 1AJ Yes https://echlinville.com/  @Echlinville 54.46909 -5.509397
Glendalough Wicklow Unit 9 Newtown Business And Enterprise Centre, Newtown Mount Kennedy, Co. Wicklow, A63 A439 A63 A439 No https://www.glendaloughdistillery.com/  @GlendaloughDist 53.085011 -6.1074016 Mark Anthony Brands International
Great Northern Distillery Louth Carrickmacross Road, Dundalk, Co. Louth, Ireland, A91 P8W9 A91 P8W9 No https://gndireland.com/  @GNDistillery 54.001574 -6.40964 Teeling Family, formally of Cooley Distillery
Hinch Distillery Down 19 Carryduff Road, Boardmills, Ballynahinch, Down, United Kingdom BT27 6TZ No https://hinchdistillery.com/  @hinchdistillery 54.461021 -5.903713
Kilbeggan Distillery Westmeath Lower Main St, Aghamore, Kilbeggan, Co. Westmeath, Ireland N91 W67N Yes https://www.kilbegganwhiskey.com  @Kilbeggan 53.369369 -7.502809 Beam Suntory
Kinahan’s Distillery Dublin 44 Fitzwilliam Place, Dublin D02 P027 No https://kinahanswhiskey.com @KinahansLL Sources Whiskey from around ireland
Lough Gill Sligo Hazelwood Avenue, Cams, Co. Sligo F91 Y820 F91 Y820 Yes https://www.athru.com/  @athruwhiskey 54.255318 -8.433156
Lough Mask Mayo Drioglann Loch Measc Teo, Killateeaun, Tourmakeady, Co. Mayo F12 PK75 Yes https://www.loughmaskdistillery.com/  @lough_mask 53.611819 -9.444077 David Raethorne
Lough Ree Longford Main Street, Lanesborough, Co. Longford N39 P229 No https://www.lrd.ie  @LoughReeDistill 53.673328 -7.99043
Matt D’Arcy Down 27 St Marys St, Newry BT34 2AA, United Kingdom BT34 2AA No http://www.mattdarcys.com  @mattdarcys 54.172817 -6.339367
Midleton Distillery Cork Old Midleton Distillery, Distillery Walk, Midleton, Co. Cork.  P25 Y394 P25 Y394 Yes https://www.jamesonwhiskey.com/en-IE/visit-us/jameson-distillery-midleton  @jamesonireland 51.916344 -8.165174 Pernod Ricard
Nephin Mayo Nephin Whiskey Company, Nephin Square, Lahardane, Co. Mayo F26 W2H9 No http://nephinwhiskey.com/  @NephinWhiskey 54.029011 -9.32211
Pearse Lyons Distillery Dublin 121-122 James’s Street Dublin 8, D08 ET27 D08 ET27 Yes https://www.pearselyonsdistillery.com  @PLDistillery 53.343708 -6.289351
Powerscourt Wicklow Powerscourt Estate, Enniskerry, Co. Wicklow, A98 A9T7 A98 A9T7 Yes https://powerscourtdistillery.com/  @PowerscourtDist 53.184167 -6.190794
Rademon Estate Distillery Down Rademon Estate Distillery, Downpatrick, County Down, United Kingdom BT30 9HR Yes https://rademonestatedistillery.com  @RademonEstate 54.396039 -5.790968
Roe & Co Dublin 92 James’s Street, Dublin 8 D08 YYW9 Yes https://www.roeandcowhiskey.com 53.343731 -6.285673
Royal Oak Distillery Carlow Clorusk Lower, Royaloak, Co. Carlow R21 KR23 Yes https://royaloakdistillery.com/  @royaloakwhiskey 52.703341 -6.978711 Saronno
Scotts Irish Distillery Fermanagh Main Street, Garrison, Co Fermanagh, BT93 4ER, United Kingdom BT93 4ER No http://scottsirish.com 54.417726 -8.083534
Skellig Six 18 Distillery Kerry Valentia Rd, Garranearagh, Cahersiveen, Co. Kerry, V23 YD89 V23 YD89 Yes https://skelligsix18distillery.ie  @SkelligSix18 51.935701 -10.239549
Slane Castle Distillery Meath Slane Castle, Slane, Co. Meath C15 F224 Yes https://www.slaneirishwhiskey.com/  @slanewhiskey 53.711065 -6.562735 Brown-Forman & Conyngham Family
Sliabh Liag Donegal Line Road, Carrick, Co Donegal, F94 X9DX F94 X9DX Yes https://www.sliabhliagdistillers.com/  @sliabhliagdistl 54.6545 -8.633847
Teeling Whiskey Distillery Dublin 13-17 Newmarket, The Liberties, Dublin 8, D08 KD91 D08 KD91 Yes https://teelingwhiskey.com/  @TeelingWhiskey 53.337862 -6.277123 Teeling Family
The Quiet Man Derry 10 Rossdowney Rd, Londonderry BT47 6NS, United Kingdom BT47 6NS No http://www.thequietmanirishwhiskey.com/  @quietmanwhiskey 54.995344 -7.301312 Niche Drinks
The Shed Distillery Leitrim Carrick on shannon Road, Drumshanbo, Co. Leitrim N41 R6D7 No http://thesheddistillery.com/  @SHEDDISTILLERY 54.047145 -8.04358
Thomond Gate Distillery Limerick No https://thomondgatewhiskey.com/ @ThomondW Nicholas Ryan
Tipperary Tipperary Newtownadam, Cahir, Co. Tipperary No http://tipperarydistillery.ie/  @TippDistillery 52.358622 -7.881875
Tullamore Distillery Offaly Bury Quay, Tullamore, Co. Offaly R35 XW13 Yes https://www.tullamoredew.com  @TullamoreDEW 53.377774 -7.492944
Walsh Whiskey Distillery Carlow Equity House, Deerpark Business Park, Dublin Rd, Carlow R93 K7W4 No http://walshwhiskey.com/  @walshwhiskey 52.853417 -6.883916 Walsh Family
Waterford Distillery Waterford 9 Mary Street, Grattan Quay, Waterford City, Co. Waterford X91 KF51 No https://waterfordwhisky.com/  @waterforddram 52.264308 -7.120997 Renegade Spirits Ireland Ltd
Wayward Irish Distillery Kerry Lakeview House & Estate, Fossa Road, Maulagh, Killarney, Co. Kerry, V93 F7Y5 V93 F7Y5 No https://www.waywardirish.com  @wayward_irish 52.071045 -9.590709 O’Connell Fomily
West Cork Distillers Cork Marsh Rd, Marsh, Skibbereen, Co. Cork P81 YY31 No http://www.westcorkdistillers.com/  @WestCorkDistill 51.557804 -9.268941

Download Data Set

Irish_Whiskey_Distilleries – Excel Spreadsheet

Irish_Whiskey_Distilleries.csv – Zipped CSV file

I’ll be adding some additional formats soon.

Map of Distilleries

Here is a map with the Distilleries plotted using Google Maps.

Screenshot 2020-02-13 15.22.40

Twitter Lists & Twitter Hash Tags

I’ve created a Twitter list containing the twitter accounts for all of these distilleries. You can subscribe to the list to get all the latest posts from these distilleries

Irish Whishkey Distillery Twitter List

Have a look out for these twitter hash tags on a Friday, Saturday or Sunday night, as people from around the world share what whiskeys they are tasting that evening. Irish and Scotish Whiskies are the most common.

#FridayNightDram
#FridayNightSip
#SaturdayNightSip
#SaturdayNightDram
#SundayNightSip
#SundayNightDram

How to send me updates, corrections and details of Distilleries I’ve missed

Let me know, via the my Contact page,  if you see any errors in the data set, especially if I’m missing any distilleries.

Data Science (The MIT Press Essential Knowledge series) – available in English, Korean and Chinese

Posted on Updated on

Back in the middle of 2018 MIT Press published my Data Science book, co-written with John Kelleher. It book was published as part of their Essentials Series.

During the few months it was available in 2018 it became a best seller on Amazon, and one of the top best selling books for MIT Press. This happened again in 2019. Yes, two years running it has been a best seller!

2020 kicks off with the book being translated into Korean and Chinese. Here are the covers of these translated books.

The Japanese and Turkish translations will be available in a few months!

Go get the English version of the book on Amazon in print, Kindle and Audio formats.

https://amzn.to/2qC84KN

This book gives a concise introduction to the emerging field of data science, explaining its evolution, relation to machine learning, current uses, data infrastructure issues and ethical challenge the goal of data science is to improve decision making through the analysis of data. Today data science determines the ads we see online, the books and movies that are recommended to us online, which emails are filtered into our spam folders, even how much we pay for health insurance.

Go check it out.

Amazon.com.          Amazon.co.uk

Screenshot 2020-02-05 11.46.03

Scottish Whisky Data Set – Updated

Posted on Updated on

The Scottish Whiskey data set consist of tasting notes and evaluations from 86 distilleries around Scotland. This data set has been around a long time andwas a promotional site for a book, Whisky Classified: Choosing Single Malts by Flavour. Written by David Wishart of the University of Saint Andrews, the book had its most recent printing in February 2012.

I’ve been using this data set in one of my conference presentations (Planning my Summer Vacation), but to use this data set I need to add 2 new attributes/features to the data set. Each of the attributes are listed below and the last 2 are the attributes I added. These were added to include the converted LAT and LONG comparable with Google Maps and other similar mapping technology.

Attributes include:

  • RowID
  • Distillery
  • Body
  • Sweetness
  • Smoky
  • Medicinal
  • Tobacco
  • Honey
  • Spicy
  • Winey
  • Nutty,
  • Malty,
  • Fruity,
  • Floral,
  • Postcode,
  • Latitude,
  • Longitude
  • lat  — newly added
  • long  — newly added

Here is the link to download and use this updated Scottish Whisky data set.

The original website is no longer available but if you have a look at the Internet Archive you will find the website.

Screenshot 2020-01-23 14.44.53

#GE2020 Comparing Party Manifestos to 2016

Posted on

A few days ago I wrote a blog post about using Python to analyze the 2016 general (government) elections manifestos of the four main political parties in Ireland.

Today the two (traditional) largest parties released their #GE2020 manifestos. You can get them by following these links

The following images show the WordClouds generated for the #GE2020 Manifestos. I used the same Python code used in my previous post. If you want to try this out yourself, all the Python code is there.

First let us look at the WordClouds from Fine Gael.

FG2020
2020 Manifesto
FG_2016
2016 Manifesto

Now for the Fianna Fail WordClouds.

FF2020
2020 Manifesto
FF_2016
2016 Manifesto

When you look closely at the differences between the manifestos you will notice there are some common themes across the manifestos from 2016 to those in the 2020 manifestos. It is also interesting to see some new words appearing/disappearing for the 2020 manifestos. Some of these are a little surprising and interesting.

#GE2020 Analysing Party Manifestos using Python

Posted on

The general election is underway here in Ireland with polling day set for Saturday 8th February. All the politicians are out campaigning and every day the various parties are looking for publicity on whatever the popular topic is for that day. Each day is it a different topic.

Most of the political parties have not released their manifestos for the #GE2020 election (as of date of this post). I want to use some simple Python code to perform some analyse of their manifestos. As their new manifestos weren’t available (yet) I went looking for their manifestos from the previous general election. Michael Pidgeon has a website with party manifestos dating back to the early 1970s, and also has some from earlier elections. Check out his website.

I decided to look at manifestos from the 4 main political parties from the 2016 general election. Yes there are other manifestos available, and you can use the Python code, given below to analyse those, with only some minor edits required.

The end result of this simple analyse is a WordCloud showing the most commonly used words in their manifestos. This is graphical way to see what some of the main themes and emphasis are for each party, and also allows us to see some commonality between the parties.

Let’s begin with the Python code.

1 – Initial Setup

There are a number of Python Libraries available for processing PDF files. Not all of them worked on all of the Part Manifestos PDFs! It kind of depends on how these files were generated. In my case I used the pdfminer library, as it worked with all four manifestos. The common library PyPDF2 didn’t work with the Fine Gael manifesto document.

import io
import pdfminer
from pprint import pprint
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage

#directory were manifestos are located
wkDir = '.../General_Election_Ire/'

#define the names of the Manifesto PDF files & setup party flag
pdfFile = wkDir+'FGManifesto16_2.pdf'
party = 'FG'
#pdfFile = wkDir+'Fianna_Fail_GE_2016.pdf'
#party = 'FF'
#pdfFile = wkDir+'Labour_GE_2016.pdf'
#party = 'LB'
#pdfFile = wkDir+'Sinn_Fein_GE_2016.pdf'
#party = 'SF'

All of the following code will run for a given manifesto. Just comment in or out the manifesto you are interested in. The WordClouds for each are given below.

2 – Load the PDF File into Python

The following code loops through each page in the PDF file and extracts the text from that page.

I added some addition code to ignore pages containing the Irish Language. The Sinn Fein Manifesto contained a number of pages which were the Irish equivalent of the preceding pages in English. I didn’t want to have a mixture of languages in the final output.

SF_IrishPages = [14,15,16,17,18,19,20,21,22,23,24]
text = ""

pageCounter = 0
resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle)
page_interpreter = PDFPageInterpreter(resource_manager, converter)

for page in PDFPage.get_pages(open(pdfFile,'rb'), caching=True, check_extractable=True):
    if (party == 'SF') and (pageCounter in SF_IrishPages):
        print(party+' - Not extracting page - Irish page', pageCounter)
    else:
        print(party+' - Extracting Page text', pageCounter)
        page_interpreter.process_page(page)

        text = fake_file_handle.getvalue()

    pageCounter += 1

print('Finished processing PDF document')
converter.close()
fake_file_handle.close()
FG - Extracting Page text 0
FG - Extracting Page text 1
FG - Extracting Page text 2
FG - Extracting Page text 3
FG - Extracting Page text 4
FG - Extracting Page text 5
...

3 – Tokenize the Words

The next step is to Tokenize the text. This breaks the text into individual words.

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
tokens = []

tokens = word_tokenize(text)

print('Number of Pages =', pageCounter)
print('Number of Tokens =',len(tokens))
Number of Pages = 140
Number of Tokens = 66975

4 – Filter words, Remove Numbers & Punctuation

There will be a lot of things in the text that we don’t want included in the analyse. We want the text to only contain words. The following extracts the words and ignores numbers, punctuation, etc.

#converts to lower case, and removes punctuation and numbers
wordsFiltered = [tokens.lower() for tokens in tokens if tokens.isalpha()]
print(len(wordsFiltered))
print(wordsFiltered)
58198
['fine', 'gael', 'general', 'election', 'manifesto', 's', 'keep', 'the', 'recovery', 'going', 'gaelgeneral', 'election', 'manifesto', 'foreward', 'from', 'an', 'taoiseach', 'the', 'long', 'term', 'economic', 'three', 'steps', 'to', 'keep', 'the', 'recovery', 'going', 'agriculture', 'and', 'food', 'generational',
...

As you can see the number of tokens has reduced from 66,975 to 58,198.

5 – Setup Stop Words

Stop words are general words in a language that doesn’t contain any meanings and these can be removed from the data set. Python NLTK comes with a set of stop words defined for most languages.

#We initialize the stopwords variable which is a list of words like 
#"The", "I", "and", etc. that don't hold much value as keywords
stop_words = stopwords.words('english')
print(stop_words)
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself',
....

Additional stop words can be added to this list. I added the words listed below. Some of these you might expect to be in the stop word list, others are to remove certain words that appeared in the various manifestos that don’t have a lot of meaning. I also added the name of the parties  and some Irish words to the stop words list.

#some extra stop words are needed after examining the data and word cloud
#these are added
extra_stop_words = ['ireland','irish','ł','need', 'also', 'set', 'within', 'use', 'order', 'would', 'year', 'per', 'time', 'place', 'must', 'years', 'much', 'take','make','making','manifesto','ð','u','part','needs','next','keep','election', 'fine','gael', 'gaelgeneral', 'fianna', 'fáil','fail','labour', 'sinn', 'fein','féin','atá','go','le','ar','agus','na','ár','ag','haghaidh','téarnamh','bplean','page','two','number','cothromfor']
stop_words.extend(extra_stop_words)
print(stop_words)

Now remove these stop words from the list of tokens.

# remove stop words from tokenised data set
filtered_words = [word for word in wordsFiltered if word not in stop_words]
print(len(filtered_words))
print(filtered_words)
31038
['general', 'recovery', 'going', 'foreward', 'taoiseach', 'long', 'term', 'economic', 'three', 'steps', 'recovery', 'going', 'agriculture', 'food',

The number of tokens is reduced to 31,038

6 – Word Frequency Counts

Now calculate how frequently these words occur in the list of tokens.

#get the frequency of each word
from collections import Counter

# count frequencies
cnt = Counter()
for word in filtered_words:
cnt[word] += 1

print(cnt)
Counter({'new': 340, 'support': 249, 'work': 190, 'public': 186, 'government': 177, 'ensure': 177, 'plan': 176, 'continue': 168, 'local': 150, 
...

7 – WordCloud

We can use the word frequency counts to add emphasis to the WordCloud. The more frequently it occurs the larger it will appear in the WordCloud.

#create a word cloud using frequencies for emphasis 
from wordcloud import WordCloud
import matplotlib.pyplot as plt

wc = WordCloud(max_words=100, margin=9, background_color='white',
scale=3, relative_scaling = 0.5, width=500, height=400,
random_state=1).generate_from_frequencies(cnt)

plt.figure(figsize=(20,10))
plt.imshow(wc)
#plt.axis("off")
plt.show()

#Save the image in the img folder:
wc.to_file(wkDir+party+"_2016.png")

The last line of code saves the WordCloud image as a file in the directory where the manifestos are located.

8 – WordClouds for Each Party

Screenshot 2020-01-21 11.10.25

Remember these WordClouds are for the manifestos from the 2016 general election.

When the parties have released their manifestos for the 2020 general election, I’ll run them through this code and produce the WordClouds for 2020. It will be interesting to see the differences between the 2016 and 2020 manifesto WordClouds.

Data Profiling in Python

Posted on Updated on

With every data analytics and data science project, one of the first tasks to that everyone needs to do is to profile the data sets. Data profiling allows you to get an initial picture of the data set, see data distributions and relationships. Additionally it allows us to see what kind of data cleaning and data transformations are necessary.

Most data analytics tools and languages have some functionality available to help you. Particular the various data science/machine learning products have this functionality built-in them and can do a lot of the data profiling automatically for you. But if you don’t use these tools/products, then you are probably using R and/or Python to profile your data.

With Python you will be working with the data set loaded into a Pandas data frame. From there you will be using various statistical functions and graphing functions (and libraries) to create a data profile. From there you will probably create a data profile report.

But one of the challenges with doing this in Python is having different coding for handling numeric and character based attributes/features. The describe function in Python (similar to the summary function in R) gives some statistical summaries for numeric attributes/features. A different set of functions are needed for character based attributes. The Python Library repository (https://pypi.org/) contains over 200K projects. But which ones are really useful and will help with your data science projects. Especially with new projects and libraries being released on a continual basis? This is a major challenge to know what is new and useful.

For example the followings shows loading the titanic data set into a Pandas data frame, creating a subset and using the describe function in Python.

import pandas as pd

df = pd.read_csv("/Users/brendan.tierney/Dropbox/4-Datasets/titanic/train.csv")

df.head(5)

Screenshot 2019-11-22 16.58.39

df2 = df.iloc[:,[1,2,4,5,6,7,8,10,11]]
df2.head(5)

Screenshot 2019-11-22 16.59.30

df2.describe()

Screenshot 2019-11-22 17.00.17

You will notice the describe function has only looked at the numeric attributes.

One of those 200+k Python libraries is one called pandas_profiling. This will create a data audit report for both numeric and character based attributes. This most be good, Right?  Let’s take a look at what it does.

For each column the following statistics – if relevant for the column type – are presented in an interactive HTML report:

  • Essentials: type, unique values, missing values
  • Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
  • Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
  • Most frequent values
  • Histogram
  • Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
  • Missing values matrix, count, heatmap and dendrogram of missing values

The first step is to install the pandas_profiling library.

pip3 install pandas_profiling

Now run the pandas_profiling report for same data frame created and used, see above.

import pandas_profiling as pp

df2.profile_report()

The following images show screen shots of each part of the report. Click and zoom into these to see more details.

Screenshot 2019-11-22 17.29.00Screenshot 2019-11-22 17.29.46

Screenshot 2019-11-22 17.30.57Screenshot 2019-11-22 17.31.32

Screenshot 2019-11-22 17.31.57Screenshot 2019-11-22 17.32.31

Screenshot 2019-11-22 17.33.02

 

Demographics vs Psychographics for Machine Learning

Posted on Updated on

When preparing data for data science, data mining or machine learning projects you will create a data set that describes the various characteristics of the subject or case record. Each attribute will contain some descriptive information about the subject and is related to the target variable in some way.

In addition to these attributes, the data set will be enriched with various other internal/external data to complete the data set.

Some of the attributes in the data set can be grouped under the heading of Demographics. Demographic data contains attributes that explain or describe the person or event each case record is focused on. For example, if the subject of the case record is based on Customer data, this is the “Who” the demographic data (and features/attributes) will be about. Examples of demographic data include:

  • Age range
  • Marital status
  • Number of children
  • Household income
  • Occupation
  • Educational level

These features/attributes are typically readily available within your data sources and if they aren’t then these name be available from a purchased data set.

Additional feature engineering methods are used to generate new features/attributes that express meaning is different ways. This can be done by combining features in different ways, binning, dimensionality reduction, discretization, various data transformations, etc. The list can go on.

The aim of all of this is to enrich the data set to include more descriptive data about the subject. This enriched data set will then be used by the machine learning algorithms to find the hidden patterns in the data. The richer and descriptive the data set is the greater the likelihood of the algorithms in detecting the various relationships between the features and their values. These relationships will then be included in the created/generated model.

Another approach to consider when creating and enriching your data set is move beyond the descriptive features typically associated with Demographic data, to include Pyschographic data.

Psychographic data is a variation on demographic data where the feature are about describing the habits of the subject or customer.  Demographics focus on the “who” while psycographics focus on the “why”. For example, a common problem with data sets is that they describe subjects/people who have things in common. In such scenarios we want to understand them at a deeper level. Psycographics allows us to do this. Examples of Psycographics include:

  • Lifestyle activities
  • Evening activities
  • Purchasing interests – quality over economy,  how environmentally concerned are you
  • How happy are you with work, family, etc
  • Social activities and changes in these
  • What attitudes you have for certain topic areas
  • What are your principles and beliefs

The above gives a far deeper insight into the subject/person and helps to differentiate each subject/person from each other, when there is a high similarity between all subjects in the data set. For example, demographic information might tell you something about a person’s age, but psychographic information will tell you that the person is just starting a family and is in the market for baby products.

I’ll close with this. Consider the various types of data gathering that companies like Google, Facebook, etc perform. They gather lots of different types of data about individuals. This allows them to build up a complete and extensive profile of all activities for individuals. They can use this to deliver more accurate marketing and advertising. For example, Google gathers data about what places to visit throughout a data, they gather all your search results, and lots of other activities. They can do a lot with this data. but now they own Fitbit. Think about what they can do with that data and particularly when combined with all the other data they have about you. What if they had access to your medical records too!  Go Google this ! You will find articles about them now having access to your health records. Again combine all of the data from these different data sources. How valuable is that data?

 

Managing imbalanced Data Sets with SMOTE in Python

Posted on Updated on

When working with data sets for machine learning, lots of these data sets and examples we see have approximately the same number of case records for each of the possible predicted values. In this kind of scenario we are trying to perform some kind of classification, where the machine learning model looks to build a model based on the input data set against a target variable. It is this target variable that contains the value to be predicted. In most cases this target variable (or feature) will contain binary values or equivalent in categorical form such as Yes and No, or A and B, etc or may contain a small number of other possible values (e.g. A, B, C, D).

For the classification algorithm to perform optimally and be able to predict the possible value for a new case record, it will need to see enough case records for each of the possible values. What this means, it would be good to have approximately the same number of records for each value (there are many ways to overcome this and these are outside the score of this post). But most data sets, and those that you will encounter in real life work scenarios, are never balanced, as in having a 50-50 split. What we typically encounter might be a 90-10, 98-2, etc type of split. These data sets are said to be imbalanced.

Screenshot 2019-05-20 15.34.14

The image above gives examples of two approaches for creating a balanced data set. The first is under-sampling. This involves reducing the class that contains the majority of the case records and reducing it to match the number of case records in the minor class. The problems with this include, the resulting data set is too small to be meaningful, the case records removed could contain important records and scenarios that the model will need to know about.

The second example is creating a balanced data set by increasing the number of records in the minority class. There are a few approaches to creating this. The first approach is to create duplicate records, from the minor class, until such time as the number of case records are approximately the same for each class. This is the simplest approach. The second approach is to create synthetic records that are statistically equivalent of the original data set. A commonly technique used for this is called SMOTE, Synthetic Minority Oversampling Technique. SMOTE uses a nearest neighbors algorithm to generate new and synthetic data we can use for training our model. But one of the issues with SMOTE is that it will not create sample records outside the bounds of the original data set. As you can image this would be very difficult to do.

The following examples will illustrate how to perform Under-Sampling and Over-Sampling (duplication and using SMOTE) in Python using functions from Pandas, Imbalanced-Learn and Sci-Kit Learn libraries.

NOTE: The Imbalanced-Learn library (e.g. SMOTE)requires the data to be in numeric format, as it statistical calculations are performed on these. The python function get_dummies was used as a quick and simple to generate the numeric values. Although this is perhaps not the best method to use in a real project. With the other sampling functions can process data sets with a sting and numeric.

Data Set:  Is the Portuaguese Banking data set and is available on the UCI Data Set Repository, and many other sites. Here are some basics with that data set.

import warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
get_ipython().magic('matplotlib inline')

bank_file = ".../bank-additional-full.csv"

# import dataset
df = pd.read_csv(bank_file, sep=';',)

# get basic details of df (num records, num features)
df.shape
df['y'].value_counts() # dataset is imbalanced with majority of class label as "no".
no     36548
yes     4640
Name: y, dtype: int64
#print bar chart
df.y.value_counts().plot(kind='bar', title='Count (target)');

Screenshot 2019-05-27 09.38.36

Example 1a – Down/Under sampling the majority class y=1 (using random sampling)

count_class_0, count_class_1 = df.y.value_counts()

# Divide by class
df_class_0 = df[df['y'] == 0] #majority class
df_class_1 = df[df['y'] == 1] #minority class

# Sample Majority class (y=0, to have same number of records as minority calls (y=1)
df_class_0_under = df_class_0.sample(count_class_1)
# join the dataframes containing y=1 and y=0
df_test_under = pd.concat([df_class_0_under, df_class_1])

print('Random under-sampling:')
print(df_test_under.y.value_counts())
print("Num records = ", df_test_under.shape[0])

df_test_under.y.value_counts().plot(kind='bar', title='Count (target)');
Random under-sampling:
1    4640
0    4640
Name: y, dtype: int64
Num records =  9280

Screenshot 2019-05-27 09.41.06

Example 1b – Down/Under sampling the majority class y=1 using imblearn

from imblearn.under_sampling import RandomUnderSampler

X = df_new.drop('y', axis=1)
Y = df_new['y']

rus = RandomUnderSampler(random_state=42, replacement=True)
X_rus, Y_rus = rus.fit_resample(X, Y)

df_rus = pd.concat([pd.DataFrame(X_rus), pd.DataFrame(Y_rus, columns=['y'])], axis=1)

print('imblearn over-sampling:')
print(df_rus.y.value_counts())
print("Num records = ", df_rus.shape[0])

df_rus.y.value_counts().plot(kind='bar', title='Count (target)');

[same results as Example 1a]

Example 1c – Down/Under sampling the majority class y=1 using Sci-Kit Learn

from sklearn.utils import resample

print("Original Data distribution")
print(df['y'].value_counts())

# Down Sample Majority class
down_sample = resample(df[df['y']==0],
replace = True, # sample with replacement
n_samples = df[df['y']==1].shape[0], # to match minority class
random_state=42) # reproducible results

# Combine majority class with upsampled minority class
train_downsample = pd.concat([df[df['y']==1], down_sample])

# Display new class counts
print('Sci-Kit Learn : resample : Down Sampled data set')
print(train_downsample['y'].value_counts())
print("Num records = ", train_downsample.shape[0])
train_downsample.y.value_counts().plot(kind='bar', title='Count (target)');

[same results as Example 1a]

Example 2 a – Over sampling the minority call y=0 (using random sampling)

df_class_1_over = df_class_1.sample(count_class_0, replace=True)
df_test_over = pd.concat([df_class_0, df_class_1_over], axis=0)

print('Random over-sampling:')
print(df_test_over.y.value_counts())

df_test_over.y.value_counts().plot(kind='bar', title='Count (target)');
Random over-sampling:
1    36548
0    36548
Name: y, dtype: int64

Screenshot 2019-05-27 09.46.08

Example 2b – Over sampling the minority call y=0 using SMOTE

from imblearn.over_sampling import SMOTE

print(df_new.y.value_counts())
X = df_new.drop('y', axis=1)
Y = df_new['y']

sm = SMOTE(random_state=42)
X_res, Y_res = sm.fit_resample(X, Y)

df_smote_over = pd.concat([pd.DataFrame(X_res), pd.DataFrame(Y_res, columns=['y'])], axis=1)

print('SMOTE over-sampling:')
print(df_smote_over.y.value_counts())

df_smote_over.y.value_counts().plot(kind='bar', title='Count (target)');

[same results as Example 2a]

Example 2c – Over sampling the minority call y=0 using Sci-Kit Learn

from sklearn.utils import resample

print("Original Data distribution")
print(df['y'].value_counts())

# Upsample minority class
train_positive_upsample = resample(df[df['y']==1],
replace = True, # sample with replacement
n_samples = train_zero.shape[0], # to match majority class
random_state=42) # reproducible results

# Combine majority class with upsampled minority class
train_upsample = pd.concat([train_negative, train_positive_upsample])

# Display new class counts
print('Sci-Kit Learn : resample : Up Sampled data set')
print(train_upsample['y'].value_counts())
train_upsample.y.value_counts().plot(kind='bar', title='Count (target)');

[same results as Example 2a]

Examples of using Machine Learning on Video and Photo in Public

Posted on Updated on

Over the past 18 months or so most of the examples of using machine learning have been on looking at images and identifying objects in them. There are the typical examples of examining pictures looking for a Cat or a Dog, or some famous person, etc. Most of these examples are very noddy, although they do illustrate important examples.

But what if this same technology was used to monitor people going about their daily lives. What if pictures and/or video was captured of you as you walked down the street or on your way to work or to a meeting. These pictures and videos are being taken of you without you knowing.

And this raises a wide range of Ethical concerns. There are the ethics of deploying such solutions in the public domain, but there are also ethical concerns for the data scientists, machine learner, and other people working on these projects. “Just because we can, doesn’t mean we should”. People need to decide, if they are working on one of these projects, if they should be working on it and if not what they can do.

Ethics are the principals of behavior based on ideas of right and wrong. Ethical principles often focus on ideas such as fairness, respect, responsibility, integrity, quality, transparency and trust.  There is a lot in that statement on Ethics, but we all need to consider that is right and what is wrong. But instead of wrong, what is grey-ish, borderline scenarios.

Here are some examples that might fall into the grey-ish space between right and wrong. Why they might fall more towards the wrong is because most people are not aware their image is being captured and used, not just for a particular purpose at capture time, but longer term to allow for better machine learning models to be built.

Can you imagine walking down the street with a digital display in front of you. That display is monitoring you, and others, and then presents personalized adverts on the digital display aim specifically at you. A classify example of this is in the film Minority Report. This is no longer science fiction.

Screenshot 2019-05-10 14.12.55

This is happening at the Westfield shopping center in London and in other cities across UK and Europe. These digital advertisement screens are monitoring people, identifying their personal characteristics and then customizing the adverts to match in with the profile of the people walking past. This solutions has been developed and rolled out by Ocean Out Door. They are using machine learning to profile the individual people based on gender, age, facial hair, eye wear, mood, engagement, attention time, group size, etc. They then use this information to:

  1. Optimisation – delivering the appropriate creative to the right audience at the right time.
  2. Visualise – Gaze recognition to trigger creative or an interactive experience
  3. AR Enabled – Using the HD cameras to create an augmented reality mirror or window effect, creating deep consumer engagement via the latest technology
  4. Analytics – Understanding your brand’s audience, post campaign analysis and creative testing

Screenshot 2019-05-10 14.19.35.png

Face Plus Plus can monitor people walking down the street and do similar profiling, and can bring it to another level where by they can identify what clothing you are wearing and what the brand is. Image if you combine this with location based services. An example of this, imagine you are walking down the high street or a major retail district. People approach you trying to entice you into going into a particular store, and they offer certain discounts. But you are with a friend and the store is not interested in them.

Screenshot 2019-05-10 14.28.23

The store is using video monitoring, capturing details of every person walking down the street and are about to pass the store. The video is using machine/deep learning to analyze you profile and what brands you are wearing. The store as a team of people who are deployed to stop and engage with certain individuals, just because they make the brands or interests of the store and depending on what brands you are wearing can offer customized discounts and offers to you.

How comfortable would you be with this? How comfortable would you be about going shopping now?

For me, I would not like this at all, but I can understand why store and retail outlets are interested, as they are all working in a very competitive market trying to maximize every dollar or euro they can get.

Along side the ethical concerns, we also have some legal aspects to consider. Some of these are a bit in the grey-ish area, as some aspects of these kind of scenarios are slightly addresses by EU GDPR and the EU Artificial Intelligence guidelines. But what about other countries around the World. Then it comes to training and deploying these facial models, they are dependent on having a good training data set. This means they needs lots and lots of pictures of people and these pictures need to be labelled with descriptive information about the person. For these public deployments of facial recognition systems, then will need more and more training samples/pictures. This will allow the models to improve and evolve over time. But how will these applications get these new pictures? They claim they don’t keep any of the images of people. They only take the picture, use the model on it, and then perform some action. They claim they do not keep the images! But how can they improve and evolve their solution?

I’ll have another blog post giving more examples of how machine/deep learning, video and image captures are being used to monitor people going about their daily lives.