The Scottish Whiskey data set consist of tasting notes and evaluations from 86 distilleries around Scotland. This data set has been around a long time andwas a promotional site for a book, Whisky Classified: Choosing Single Malts by Flavour. Written by David Wishart of the University of Saint Andrews, the book had its most recent printing in February 2012.
I’ve been using this data set in one of my conference presentations (Planning my Summer Vacation), but to use this data set I need to add 2 new attributes/features to the data set. Each of the attributes are listed below and the last 2 are the attributes I added. These were added to include the converted LAT and LONG comparable with Google Maps and other similar mapping technology.
- lat — newly added
- long — newly added
Here is the link to download and use this updated Scottish Whisky data set.
The original website is no longer available but if you have a look at the Internet Archive you will find the website.
When preparing data for data science, data mining or machine learning projects you will create a data set that describes the various characteristics of the subject or case record. Each attribute will contain some descriptive information about the subject and is related to the target variable in some way.
In addition to these attributes, the data set will be enriched with various other internal/external data to complete the data set.
Some of the attributes in the data set can be grouped under the heading of Demographics. Demographic data contains attributes that explain or describe the person or event each case record is focused on. For example, if the subject of the case record is based on Customer data, this is the “Who” the demographic data (and features/attributes) will be about. Examples of demographic data include:
- Age range
- Marital status
- Number of children
- Household income
- Educational level
These features/attributes are typically readily available within your data sources and if they aren’t then these name be available from a purchased data set.
Additional feature engineering methods are used to generate new features/attributes that express meaning is different ways. This can be done by combining features in different ways, binning, dimensionality reduction, discretization, various data transformations, etc. The list can go on.
The aim of all of this is to enrich the data set to include more descriptive data about the subject. This enriched data set will then be used by the machine learning algorithms to find the hidden patterns in the data. The richer and descriptive the data set is the greater the likelihood of the algorithms in detecting the various relationships between the features and their values. These relationships will then be included in the created/generated model.
Another approach to consider when creating and enriching your data set is move beyond the descriptive features typically associated with Demographic data, to include Pyschographic data.
Psychographic data is a variation on demographic data where the feature are about describing the habits of the subject or customer. Demographics focus on the “who” while psycographics focus on the “why”. For example, a common problem with data sets is that they describe subjects/people who have things in common. In such scenarios we want to understand them at a deeper level. Psycographics allows us to do this. Examples of Psycographics include:
- Lifestyle activities
- Evening activities
- Purchasing interests – quality over economy, how environmentally concerned are you
- How happy are you with work, family, etc
- Social activities and changes in these
- What attitudes you have for certain topic areas
- What are your principles and beliefs
The above gives a far deeper insight into the subject/person and helps to differentiate each subject/person from each other, when there is a high similarity between all subjects in the data set. For example, demographic information might tell you something about a person’s age, but psychographic information will tell you that the person is just starting a family and is in the market for baby products.
I’ll close with this. Consider the various types of data gathering that companies like Google, Facebook, etc perform. They gather lots of different types of data about individuals. This allows them to build up a complete and extensive profile of all activities for individuals. They can use this to deliver more accurate marketing and advertising. For example, Google gathers data about what places to visit throughout a data, they gather all your search results, and lots of other activities. They can do a lot with this data. but now they own Fitbit. Think about what they can do with that data and particularly when combined with all the other data they have about you. What if they had access to your medical records too! Go Google this ! You will find articles about them now having access to your health records. Again combine all of the data from these different data sources. How valuable is that data?
When working with data sets for machine learning, lots of these data sets and examples we see have approximately the same number of case records for each of the possible predicted values. In this kind of scenario we are trying to perform some kind of classification, where the machine learning model looks to build a model based on the input data set against a target variable. It is this target variable that contains the value to be predicted. In most cases this target variable (or feature) will contain binary values or equivalent in categorical form such as Yes and No, or A and B, etc or may contain a small number of other possible values (e.g. A, B, C, D).
For the classification algorithm to perform optimally and be able to predict the possible value for a new case record, it will need to see enough case records for each of the possible values. What this means, it would be good to have approximately the same number of records for each value (there are many ways to overcome this and these are outside the score of this post). But most data sets, and those that you will encounter in real life work scenarios, are never balanced, as in having a 50-50 split. What we typically encounter might be a 90-10, 98-2, etc type of split. These data sets are said to be imbalanced.
The image above gives examples of two approaches for creating a balanced data set. The first is under-sampling. This involves reducing the class that contains the majority of the case records and reducing it to match the number of case records in the minor class. The problems with this include, the resulting data set is too small to be meaningful, the case records removed could contain important records and scenarios that the model will need to know about.
The second example is creating a balanced data set by increasing the number of records in the minority class. There are a few approaches to creating this. The first approach is to create duplicate records, from the minor class, until such time as the number of case records are approximately the same for each class. This is the simplest approach. The second approach is to create synthetic records that are statistically equivalent of the original data set. A commonly technique used for this is called SMOTE, Synthetic Minority Oversampling Technique. SMOTE uses a nearest neighbors algorithm to generate new and synthetic data we can use for training our model. But one of the issues with SMOTE is that it will not create sample records outside the bounds of the original data set. As you can image this would be very difficult to do.
The following examples will illustrate how to perform Under-Sampling and Over-Sampling (duplication and using SMOTE) in Python using functions from Pandas, Imbalanced-Learn and Sci-Kit Learn libraries.
NOTE: The Imbalanced-Learn library (e.g. SMOTE)requires the data to be in numeric format, as it statistical calculations are performed on these. The python function get_dummies was used as a quick and simple to generate the numeric values. Although this is perhaps not the best method to use in a real project. With the other sampling functions can process data sets with a sting and numeric.
Data Set: Is the Portuaguese Banking data set and is available on the UCI Data Set Repository, and many other sites. Here are some basics with that data set.
import warnings import pandas as pd import numpy as np import matplotlib.pyplot as plt get_ipython().magic('matplotlib inline') bank_file = ".../bank-additional-full.csv" # import dataset df = pd.read_csv(bank_file, sep=';',) # get basic details of df (num records, num features) df.shape
df['y'].value_counts() # dataset is imbalanced with majority of class label as "no".
no 36548 yes 4640 Name: y, dtype: int64
#print bar chart df.y.value_counts().plot(kind='bar', title='Count (target)');
Example 1a – Down/Under sampling the majority class y=1 (using random sampling)
count_class_0, count_class_1 = df.y.value_counts() # Divide by class df_class_0 = df[df['y'] == 0] #majority class df_class_1 = df[df['y'] == 1] #minority class # Sample Majority class (y=0, to have same number of records as minority calls (y=1) df_class_0_under = df_class_0.sample(count_class_1) # join the dataframes containing y=1 and y=0 df_test_under = pd.concat([df_class_0_under, df_class_1]) print('Random under-sampling:') print(df_test_under.y.value_counts()) print("Num records = ", df_test_under.shape) df_test_under.y.value_counts().plot(kind='bar', title='Count (target)');
Random under-sampling: 1 4640 0 4640 Name: y, dtype: int64 Num records = 9280
Example 1b – Down/Under sampling the majority class y=1 using imblearn
from imblearn.under_sampling import RandomUnderSampler X = df_new.drop('y', axis=1) Y = df_new['y'] rus = RandomUnderSampler(random_state=42, replacement=True) X_rus, Y_rus = rus.fit_resample(X, Y) df_rus = pd.concat([pd.DataFrame(X_rus), pd.DataFrame(Y_rus, columns=['y'])], axis=1) print('imblearn over-sampling:') print(df_rus.y.value_counts()) print("Num records = ", df_rus.shape) df_rus.y.value_counts().plot(kind='bar', title='Count (target)');
[same results as Example 1a]
Example 1c – Down/Under sampling the majority class y=1 using Sci-Kit Learn
from sklearn.utils import resample print("Original Data distribution") print(df['y'].value_counts()) # Down Sample Majority class down_sample = resample(df[df['y']==0], replace = True, # sample with replacement n_samples = df[df['y']==1].shape, # to match minority class random_state=42) # reproducible results # Combine majority class with upsampled minority class train_downsample = pd.concat([df[df['y']==1], down_sample]) # Display new class counts print('Sci-Kit Learn : resample : Down Sampled data set') print(train_downsample['y'].value_counts()) print("Num records = ", train_downsample.shape) train_downsample.y.value_counts().plot(kind='bar', title='Count (target)');
[same results as Example 1a]
Example 2 a – Over sampling the minority call y=0 (using random sampling)
df_class_1_over = df_class_1.sample(count_class_0, replace=True) df_test_over = pd.concat([df_class_0, df_class_1_over], axis=0) print('Random over-sampling:') print(df_test_over.y.value_counts()) df_test_over.y.value_counts().plot(kind='bar', title='Count (target)');
Random over-sampling: 1 36548 0 36548 Name: y, dtype: int64
Example 2b – Over sampling the minority call y=0 using SMOTE
from imblearn.over_sampling import SMOTE print(df_new.y.value_counts()) X = df_new.drop('y', axis=1) Y = df_new['y'] sm = SMOTE(random_state=42) X_res, Y_res = sm.fit_resample(X, Y) df_smote_over = pd.concat([pd.DataFrame(X_res), pd.DataFrame(Y_res, columns=['y'])], axis=1) print('SMOTE over-sampling:') print(df_smote_over.y.value_counts()) df_smote_over.y.value_counts().plot(kind='bar', title='Count (target)');
[same results as Example 2a]
Example 2c – Over sampling the minority call y=0 using Sci-Kit Learn
from sklearn.utils import resample print("Original Data distribution") print(df['y'].value_counts()) # Upsample minority class train_positive_upsample = resample(df[df['y']==1], replace = True, # sample with replacement n_samples = train_zero.shape, # to match majority class random_state=42) # reproducible results # Combine majority class with upsampled minority class train_upsample = pd.concat([train_negative, train_positive_upsample]) # Display new class counts print('Sci-Kit Learn : resample : Up Sampled data set') print(train_upsample['y'].value_counts()) train_upsample.y.value_counts().plot(kind='bar', title='Count (target)');
[same results as Example 2a]
This is the fourth blog post of a series on using Oracle Text, Oracle R Enterprise and Oracle Data Mining. Make sure to check out the previous blog posts as each one builds upon each other.
In this blog post, I will have an initial look at how you can use Oracle Text to perform document classification. In my next blog post, in the series, I will look at how you can use Oracle Data Mining with Oracle Text to perform classification.
The area of document classification using Oracle Text is a well trodden field and there are lots and lots of material out there to assist you. This blog post will look at the core steps you need to follow and how Oracle Text can help you with classifying your documents or text objects in a table.
When you use Oracle Text for documentation classification the simplest approach is to use ‘Rule-based Classification’. With this approach you will defined a set of rules, when applied to the document will determine classification that will be assigned to the document.
There is a little bit of setup and configuration needed to make this happen. This includes the following.
- Create a table that will store you document. See my previous blog posts in the series to see an example of one that is used to store the text from webpages.
- Create a rules table. This will contain the classification label and then a set of rules that will be used by Oracle Text to determine that classification to assign to the document. These are in the format similar to what you might see in the WHERE clause of a SELECT statement. You will need follow the rules and syntax of CTXRULES to make sure your rules fire correctly.
- Create a CTXRULE index on the rules table you created in the previous step.
- Create a table that will be a link table between the table that contains your documents and the table that contains your categories.
When you have these steps completed you can now start classifying your documents. The following example illustrates using these steps using the text documents I setup in my previous blog posts.
Here is the structure of my documents table. I had also created an Oracle Text
CTXSYS.CONTEXT index on the DOC_TEXT attribute.
create table MY_DOCUMENTS ( doc_pk NUMBER(10) PRIMARY KEY, doc_title VARCHAR2(100), doc_extracted DATE, data_source VARCHAR2(200), doc_text CLOB );
The next step is to create a table that contains our categories and rules. The structure of this table is very simple, and the following is an example.
create table DOCUMENT_CATEGORIES ( doc_cat_pk NUMBER(10) PRIMARY KEY, doc_category VARCHAR2(40), doc_cat_query VARCHAR2(2000) ); create sequence doc_cat_seq;
Now we can create the table that will store the identified document categories/classifications for each of out documents. This is a link table that contains the primary keys from the MY_DOCUMENTS and the MY_DOCUMENT_CATEGORIES tables.
create table MY_DOC_CAT ( doc_pk NUMBER(10), doc_cat_pk NUMBER(10) );
Queries for CTXRULE are similar to those of CONTAINS queries. Basic phrasing within quotes is supported, as are the following CONTAINS operators: ABOUT, AND, NEAR, NOT, OR, STEM, WITHIN, and THESAURUS. The following statements contain my rules.
insert into document_categories values (doc_cat_seq.nextval, 'OAA','Oracle Advanced Analytics'); insert into document_categories values (doc_cat_seq.nextval, 'Oracle Data Mining','ODM or Oracle Data Mining'); insert into document_categories values (doc_cat_seq.nextval, 'Oracle Data Miner','ODMr or Oracle Data Miner or SQL Developer'); insert into document_categories values (doc_cat_seq.nextval, 'R Technologies','Oracle R Enterprise or ROacle or ORAACH or R');
We are now ready to create the Oracle Text CTXRULE index.
create index doc_cat_idx on document_categories(doc_cat_query) indextype is ctxsys.ctxrule;
Our next step is to apply the rules and to generate the categories/classifications. We have two scenarios to deal with here. The first is how do we do this for our existing records and the second to how can you do this ongoing as new documents get loaded into the MY_DOCUMENTS table.
For the first scenario, where the documents already exist in our table, we can can use a procedure, just like the following.
DECLARE v_document MY_DOCUMENTS.DOC_TEXT%TYPE; v_doc MY_DOCUMENTS.DOC_PK%TYPE; BEGIN for doc in (select doc_pk, doc_text from my_documents) loop v_document := doc.doc_text; v_doc := doc.doc_pk; for c in (select doc_cat_pk from document_categories where matches(doc_cat_query, v_document) > 0 ) loop insert into my_doc_cat values (doc.doc_pk, c.doc_cat_pk); end loop; end loop; END; /
Let us have a look at the categories/classifications that were generated.
select a.doc_title, c.doc_cat_pk, b.doc_category from my_documents a, document_categories b, my_doc_cat c where a.doc_pk = c.doc_pk and c.doc_cat_pk = b.doc_cat_pk order by a.doc_pk, c.doc_cat_pk;
We can see the the categorisation/classification actually gives us the results we would have expected of these documents/web pages.
Now we can look at how to generate these these categories/classifications on an on going basis. For this we will need a database trigger on the MY_DOCUMENTS table. Something like the following should do the trick.
CREATE or REPLACE TRIGGER t_cat_doc before insert on MY_DOCUMENTS for each row BEGIN for c in (select doc_cat_pk from document_categories where matches(doc_cat_query, :new.doc_text)>0) loop insert into my_doc_cat values (:new.doc_pk, c.doc_cat_pk); end loop; END;
At this point we have now worked through how to build and use Oracle Text to perform Rule based document categorisation/classification.
In addition to this type of classification, Oracle Text also has uses some machine learning algorithms to classify documents. These include using Decision Trees, Support Vector Machines and Clustering. It is important to note that these are not the machine learning algorithms that come as part of Oracle Data Mining. Look out of my other blog posts that cover these topics.
This is the fourth and last blog post in a series that looks at how you can examine the details of predicted clusters using Oracle Data Mining. In the previous blog posts I looked at how to use CLUSER_ID, CLUSTER_PROBABILITY and CLUSTER_SET.
In this blog post we will look at CLUSTER_DISTANCE. We can use the function to determine how close a record is to the centroid of the cluster. Perhaps we can use this to determine what customers etc we might want to focus on most. The customers who are closest to the centroid are one we want to focus on first. So we can use it as a way to prioritise our workflows, particularly when it is used in combination with the value for CLUSTER_PROBABILITY.
Here is an example of using CLUSTER_DISTANCE to list all the records that belong to Cluster 14 and the results are ordered based on closeness to the centroid of this cluster.
SELECT customer_id, cluster_probability(clus_km_1_37 USING *) as cluster_Prob, cluster_distance(clus_km_1_37 USING *) as cluster_Distance FROM insur_cust_ltv_sample WHERE cluster_id(clus_km_1_37 USING *) = 14 order by cluster_Distance asc;
Here is a subset of the results from this query.
When you examine the results you may notice that the records that is listed first and closest record to the centre of cluster 14 has a very low probability. You need to remember that we are working in a N-dimensional space here. Although this first record is closest to the centre of cluster 14 it has a really low probability and if we examine this record in more detail we will find that it is at an overlapping point between a number of clusters.
This is why we need to use the CLUSTER_DISTANCE and CLUSTER_PROBABILITY functions together in our workflows and applications to determine how we need to process records like these.
This is the third blog post on my series on examining the Clusters that were predicted by an Oracle Data Mining model. Check out the previous blog posts.
- Part 1 – Examining predicted Clusters and Cluster details using SQL
- Part 2 – Cluster Details with Oracle Data Mining
In the previous posts we were able to list the predicted cluster for each record in our data set. This is the cluster that the records belonged to the most. I also mentioned that a record could belong to many clusters.
So how can you list all the clusters that the a record belongs to?
You can use the CLUSTER_SET SQL function. This will list the Cluster Id and a probability measure for each cluster. This function returns a array consisting of the set of all clusters that the record belongs to.
The following example illustrates how to use the CLUSTER_SET function for a particular cluster model.
SELECT t.customer_id, s.cluster_id, s.probability FROM (select customer_id, cluster_set(clus_km_1_37 USING *) as Cluster_Set from insur_cust_ltv_sample WHERE customer_id in ('CU13386', 'CU100')) T, TABLE(T.cluster_set) S order by t.customer_id, s.probability desc;
The output from this query will be an ordered data set based on the customer id and then the clusters listed in descending order of probability. The cluster with the highest probability is what would be returned by the CLUSTER_ID function. The output from the above query is shown below.
If you would like to see the details of each of the clusters and to examine the differences between these clusters then you will need to use the CLUSTER_DETAILS function (see previous blog post).
You can specify topN and cutoff to limit the number of clusters returned by the function. By default, both topN and cutoff are null and all clusters are returned.
– topN is the N most probable clusters. If multiple clusters share the Nth probability, then the function chooses one of them.
– cutoff is a probability threshold. Only clusters with probability greater than or equal to cutoff are returned. To filter by cutoff only, specify NULL for topN.
You may want to use these individually or combined together if you have a large number of customers. To return up to the N most probable clusters that are greater than or equal to cutoff, specify both topN and cutoff.
The following example illustrates using the topN value to return the top 4 clusters.
SELECT t.customer_id, s.cluster_id, s.probability FROM (select customer_id, cluster_set(clus_km_1_37, 4, null USING *) as Cluster_Set from insur_cust_ltv_sample WHERE customer_id in ('CU13386', 'CU100')) T, TABLE(T.cluster_set) S order by t.customer_id, s.probability desc;
and the output from this query shows only 4 clusters displayed for each record.
Alternatively you can select the clusters based on a cut off value for the probability. In the following example this is set to 0.05.
SELECT t.customer_id, s.cluster_id, s.probability FROM (select customer_id, cluster_set(clus_km_1_37, NULL, 0.05 USING *) as Cluster_Set from insur_cust_ltv_sample WHERE customer_id in ('CU13386', 'CU100')) T, TABLE(T.cluster_set) S order by t.customer_id, s.probability desc;
and the output this time looks a bit different.
Finally, yes you can combine these two parameters to work together.
SELECT t.customer_id, s.cluster_id, s.probability
FROM (select customer_id, cluster_set(clus_km_1_37, 2, 0.05 USING *) as Cluster_Set
WHERE customer_id in (‘CU13386’, ‘CU100’)) T,
order by t.customer_id, s.probability desc;
PMML (Predictive Model Markup Langauge) is an XML formatted output that defines the core elements and settings for your Predictive Models. This XML formatted output can be used to migrate your models from one data mining or predictive modelling tool to another data mining or predictive modelling tool, such as Oracle.
Using PMML to migrate your models from one tool to another allows for you to use the most appropriate tools for developing your models and then allows them to be imported into another tool that will be used for deploying your predictive models in batch or real-time mode. In particular the ability to use your Predictive Model within your everyday applications enables you to work in the area of Automatic or Prescriptive Analytics. Oracle Data Mining and the Oracle Database are ideal or even the best possible tools to allow for Automatic and Prescriptive Analytics for your transa
PMML is an XML based standard specified by the Data Mining Group
Oracle Data Mining supports the importing of PMML models that are compliant with version 3.1 of the standard and for Regression Models only. The regression models can be for linear regression or binary logistic regression.
The Data Mining Group Archive webpage have a number of sample PMML files for you to download and then to load into your Oracle database.
To Load the PMML file into your Oracle Database you can use the DBMS_DATA_MINING.IMPORT_MODEL function. I’ve given examples of how you can use this function to import an Oracle Data Mining model that was exported using the EXPORT_MODEL function.
The syntax of the IMPORT_MODEL function when importing a PMML file is the following
DBMS_DATA_MINING.IMPORT_MODEL ( model_name IN VARCHAR2, pmmldoc IN XMLTYPE strict_check IN BOOLEAN DEFAULT FALSE);
The following example shows how you can load the version 3.1 Logistic Regression PMML file from the Data Mining Group archive webpage
BEGIN dbms_data_mining.IMPORT_MODEL (‘PMML_MODEL', XMLType (bfilename (‘IMPORT_DIR', 'sas_3.1_iris_logistic_reg.xml'), nls_charset_id ('AL32UTF8') )); END;
This example uses the default value for STRICT_CHECK as FALASE. In this case if there are any errors in the PMML structure then these will be ignored and the imported model may contain “features” that may make it perform in a slightly odd manner.