oracle data mining

Clustering in Oracle Data Miner–Part 4

Posted on March 5, 2013

This is a the fourth part of a five (5) part blog post on building and using Clustering in Oracle Data Miner. The following outlines the contents of each post in this series on Clustering.

The first part we looked at what clustering features exist in ODM and how to setup the data that we will be using in the examples
The second part will focus on how to building Clusters in ODM .
The third post will focus on examining the clusters produced by ODM and how to use the Clusters to apply to new data using ODM.
The fourth post will look at how you can build and evaluate a Clustering model using the ODM SQL and PL/SQL functions.
The fifth and final post will look at how you can apply your Clustering model to new data using the ODM SQL and PL/SQL functions.

With Clustering we are trying to find hidden patterns in our data. Unlike classification we a not directing the algorithms on what areas/attributes to focus on.
In my preceding post on Clustering in Oracle Data Miner I showed how you can use the Oracle Data Miner (ODM) tool, that is part of SQL Developer, to build a cluster model and to apply it to new data.
As I have shown in previous blog posts we can do all of the same steps using SQL and PL/SQL. In this blog post I will show you how to to these SQL and PL/SQL functions to generate a Clustering model and how you can evaluate the model produced.
Step 1 – What Cluster Models to we already have
Oracle Data Miner comes with a number of data dictionary tables. We can use these to see what already exists in our schema. In the example below we will have a look to see the Cluster models that were produced in the Part 2 blog post.
format model_name format a20
format mining_function format a20
format algorithm format a20
SELECT model_name,
       mining_function,
       algorithm,
       build_duration,
       model_size
FROM ALL_MINING_MODELS
WHERE mining_function = ‘CLUSTERING’;

We can also look at the model settings that ODM produced. We can look at the settings of the K-Means model. This model (in my case) is called CLUS_KM_1_25.
column setting_value format a20
select setting_name,
         setting_value,
         setting_type
from all_mining_model_settings
where model_name = ‘CLUS_KM_1_25’

We can also look to see what attributes are used in the K-Mean model.
SELECT attribute_name,
       attribute_type,
       usage_type,
       target
from all_mining_model_attributes
where model_name = ‘CLUS_KM_1_25’

I will show you how to use this Clustering model or the one that we are about to generate in my next blog post.
Step 2 – Creating the Setting table
When creating an Oracle Data Mining model in SQL you need to create Setting table. This will contain a record for each setting they you need to set for the algorithm you want to run. The settings table has a very basic structure and only consists of two columns. It is a good idea to create a separate Settings table for each algorithm or type of data mining you want to do. To create the settings table we run the following:
CREATE TABLE CLUSTER_SETTINGS (
Setting_Name VARCHAR2(30),
Setting_Value VARCHAR2(4000));
The next step is to define the setting we want to use. The first of these will be the actual algorithm you want to run. Oracle has two Clustering algorithms: algo_kmeans and algo_o_cluster. In the examples we are working through we are using the K-Mean algorithm and we want to set the number of clusters to 10. We are also going to set the Automatic Data Preparation (ADP) on. Here is the code.
BEGIN
   INSERT INTO CLUSTER_SETTINGS (setting_name, setting_value)
   VALUES (dbms_data_mining.algo_name, dbms_data_mining.ALGO_KMEANS);
   INSERT INTO CLUSTER_SETTINGS (setting_name, setting_value)
   VALUES (dbms_data_mining.prep_auto, dbms_data_mining.PREP_AUTO_ON);
   INSERT INTO CLUSTER_SETTINGS (setting_name, setting_value)
   VALUES (dbms_data_mining.clus_num_clusters, 10);

   COMMIT;
END;
We can check to see that these records were inserted.
column setting_value format a25
select * from CLUSTER_SETTINGS;

The other settings will be dependent on what clustering algorithm is being used. Each one has their own setting. If you do not define any additional setting Oracle will use the in-build default settings.
To see what the default settings:
column setting_value format a35
select *
from table(dbms_data_mining.get_default_settings)
where setting_name like ‘KM%’;

If you want to use a different value to the default, just create a new record in the CLUSTER_SETTINGS table with the new value.

Step 3 – Create the new Cluster Model
We have the algorithm settings ready, so the next steps is to create the Cluster model. We will be using the DBMS_DATA_MINING.CREATE_MODEL function and pointing it to our CLUSTER_SETTINGS table.
BEGIN
   DBMS_DATA_MINING.CREATE_MODEL(
      model_name          => ‘CLUSTER_KMEANS_MODEL’,
      mining_function     => dbms_data_mining.clustering,
      data_table_name     => ‘INSURANCE_CUST_LTV’,
      case_id_column_name => ‘CUSTOMER_ID’,
      target_column_name => null,
      settings_table_name => ‘CLUSTER_SETTINGS’);
END;
This takes a few seconds to run on my laptop. When finished we can uses queries like we used in Step 1 above to find out the details of what was produced.
Step 4 – Examining the Cluster model
The first thing that we will need to check is that the model was created.
format model_name format a20
format mining_function format a15
format algorithm format a15
SELECT model_name,
       mining_function,
       algorithm,
       build_duration,
       model_size
FROM ALL_MINING_MODELS
WHERE mining_function = ‘CLUSTERING’;

One thing that is a bit curious is the mode size. We have basically created two K-means models using the same settings, but the final models have different sizes. See below for an explanation.
Now lets look at the algorithms settings for our new Cluster model (Cluster KMeans Model)
select setting_name,
       setting_value,
       setting_type
from all_mining_model_settings
where model_name = ‘CLUSTER_KMEANS_MODEL’

We can also look at the attributes used in the clusters.
SELECT attribute_name,
       attribute_type,
       usage_type,
       target
from all_mining_model_attributes
where model_name = ‘CLUSTER_KMEANS_MODEL’

If we compare this list back to the list of attributes that were part of the model that we developed in the ODM tool, you will see that we have some extra attributes listed for the new K-means model.
The reason for these extra attributes and the bigger size (although still small) will be due to us having applied a sampling in the ODM tool. We took a subset of the data in the ODM tool and built the model based on this data. For the model we just created in SQL we took all the data (no sampling). That is why it would have looked at including more attributes because it was looking at a much bigger data set, etc.
Step 5 – Examining the Cluster Details
In this step we are going to look at some of the clusters produced by our new Cluster model.
We can produce the information relating to the clusters that were produced and what level in the hierarchy each cluster belongs. We can use this information to create/draw our own cluster hierarchy diagram.
set lines 90
column child format a40
select id,
       record_count,
       parent,
       tree_level,
       child
from table(dbms_data_mining.get_model_details_km(‘CLUSTER_KMEANS_MODEL’))

To look at the Centroid details for a particular cluster e.g. Cluster ID = 7, we can run:
column attribute_name format a25
column mode_value format a25
select t.id,
       c.attribute_name,
       c.mean,
       c.mode_value,
       c.variance
from table (dbms_data_mining.get_model_details_KM(‘CLUSTER_KMEANS_MODEL’)) t,
     table(t.centroid) c
where t.id = 7
order by c.attribute_name
The results will contain details of the centroid for cluster number 7. The centroid details will include the attributes that were used to form the centroid and the values for each attribute.

The last thing we can look at is the rule for a cluster. The following query will give you the full details of the cluster rule for Cluster 7. This will give for each attribute what the values or ranges of values that is used for that cluster. I’ll let you run this as it returns a large number of records. Run it and have a look at some of the output.
select t.id,
       a.attribute_name,
       a.conditional_operator,
       nvl(a.attribute_str_value,
       a.attribute_num_value) AS value,
       a.attribute_support,
       a.attribute_confidence
from TABLE(dbms_data_mining.get_model_details_km(‘CLUSTER_KMEANS_MODEL’)) t,
     TABLE(t.rule.antecedent) a
where t.id = 7
ORDER BY t.id, a.attribute_name, attribute_support, attribute_confidence desc, value;

My next blog post on Clustering in Oracle, will be the final one in the series. It will show you how to apply a Clustering model to new data in your database.

This entry was posted in ODM 11g R2, Oracle Advanced Analytics, Oracle Analytics Option, Oracle Data Miner, oracle data mining, Oracle Data Mining 11g R2, OTN.

Clustering in Oracle Data Miner–Part 3

Posted on February 27, 2013

This is a the third part of a five (5) part blog post on building and using Clustering in Oracle Data Miner. The following outlines the contents of each post in this series on Clustering.

The first part we looked at what clustering features exist in ODM and how to setup the data that we will be using in the examples
The second part will focus on how to building Clusters in ODM .
The third post will focus on examining the clusters produced by ODM and how to use the Clusters to apply to new data using ODM.
The fourth post will look at how you can build and evaluate a Clustering model using the ODM SQL and PL/SQL functions.
The fifth and final post will look at how you can apply your Clustering model to new data using the ODM SQL and PL/SQL functions.

In my previous posts on Clustering in ODM we have setup our data, we have explored it, we have taken a sample of the data and used this sample as input to the Cluster Build node. Oracle Data Miner has two clustering algorithms and our Cluster Build node created a clustering model for each.

In this post we will look at the next steps. The first of these is that we will look at examining what clustering models ODM produced. In the second part we will look at how we can use one of these clustering models to apply and label new data.

Step 1 – View the Cluster Models

To view the the cluster modes we need to right click the Cluster Build node and select View Models from the drop down list. We get an additional down down menu that gives the names of the two cluster models that were developed.

In my case these are called CLUS_KM_1_25 and CLUS_OC_1_25. You may get different numbers on your model names. These numbers are generated internally in ODM

The first mode that we will look at will be the K-Mean Cluster Model (CLUS_KM_1_25). Select this from the menu.

Step 2 – View the Cluster Rules

The hierarchical K-Mean cluster mode will be displayed. You might need to readjust/resize some of the worksheets/message panes etc in ODM to get the good portion of the diagram to display.

With ODM you cannot change, alter, merge, split, etc any the clusters that were generated. Oracle take the view of, this is what we have found it it up to you now to decide how you are going to use it.

To see that the cluster rules are for each cluster you can click on a cluster. When you do this you should get a pane (under the cluster diagram) that will contain two tabs, Centroid and Cluster Rule.

The Centroid tab provides a list of the attributes that best define the selected cluster, along with the average value for each attribute and some basic statistical information.

The Cluster Rules tab contains a set of rules that define the cluster in a IF/THEN statement format.

For each cluster in the tree we can see the number of cases in each cluster the percentage of overall cases for this cluster.

Work your way down the tree exploring each of the clusters produced.

The further down the tree you go the smaller the percentage of cases will fall into each cluster. In some tools you can merge these clusters. Not so in ODM. What you have to do is to use an IF statement in your code. Something like IF cluster_num IN (16, 17, 18, 19) THEN …..

Step 3 – Compare Clusters

In addition to the cluster tree, ODM also has two addition tabs to allow us to explore the clusters. These are Detail and Compare tabs.

Click on the Detail tab. We now get a detailed screen that contain various statistical information for each attribute. We can for each attribute get a histogram of the values within each attribute for this cluster.

We can use this important to start building up a picture of what each cluster might represent based on the values (and their distribution) for each cluster.

Try this out for a few clusters.

Step 4 – Multi-Cluster – Multi-variable Comparison of Clusters

The next level of comparison and evaluation of the clusters can be found under the Compare tab.

This lets us compare two clusters against each other at an attribute level. For example let us compare cluster 4 and 9. The attribute and graphics section gets updated to reflect the data for each of cluster. These are colour coded to distinguish the two clusters.

We can work our way down through each attribute and again we can use this information to help us to understand what each cluster might represent.

An additional feature here is that we can do multi-variable (attribute) comparison. Holding down the control button select LTV_BIN, SEX and AGE. With each selection we get a new graph appearing at the bottom of the screen. This shows the distribution of the values by attribute for each cluster. We can learn a lot from this.

So one possible conclusion we could draw from this data would be that Cluster 4 could be ‘Short Term Value Customers’ and Cluster 9 could be ‘Long Term Value Customer’

Step 5 – Renaming Clusters

When you have discovered a possible meaning for a Cluster, you can give it a meaningful name instead of it having a number. In our example, we would like to re-label Cluster 4 to ‘Short Term Value Customers’. To do this click on the Edit button that is beside the drop down that has cluster 4. Enter the new label and click OK.

In the drop down we will now get the new label appearing instead of the cluster number.

Similarly we can do this for the other cluster e.g. ‘Long Term Value Customer’.

We have just looked at how to explore our K-Means model. You can do similar exploration of the O-Cluster model. I’ll leave that for you to do.

We have now explored our clusters and we have decided which of our Clustering Models best suits our needs. In our scenario we are going to select the K-Mean model to apply and label our new data.

Step 1 – Create the Apply Node

We have already setup our sample of data that we are going to use as our Apply Data Set. We did this when we setup the two different Sample node.

We are going to use the Sample node that was set to 40%.

The first step requires us to create an Apply Node. This can be found under the Component Palette and Evaluate and Apply tab. Click on the Apply node and move the mouse to the workflow worksheet and click near the Sample Apply node.

To connect the two nodes, move the mouse to the Sample Apply node and right click. Select Connect from the drop down menu and then move the mouse to the Apply node and click again. An connection arrow will be created joining these nodes.

Step 2 – Specify which Clustering Model to use & Output Data

Next we need to specify which of the clustering model we want to use to apply to our new data.

We need to connect the Cluster Build node to the Apply node. Move the mouse to the Cluster Build node, right click and select connect from the drop down menu. Move the mouse to the Apply node and click. We get the connection arrow between the two node.

We now have joined the Data and the Cluster Build node to the Apply node.

The final step is to specify what clustering mode we would like to use. In our scenario we are going to specify the K-Mean model.

(Single) Click the Cluster Build node. We now need to use the Property Inspector to select the K-Means model for the apply set. In the Models tab of the Property Inspector we should have our two cluster models listed. Under the Output column click in the box for the O-Cluster model. We should now get a little red X mark appearing. The K-Mean model should still have the green arrow under the Output column.

Step 3 – Run the Apply Node

We have one last data setup to do on the Apply node. We need to specify what data from the apply data set we want to include in the output from the Apply node. For simplicity we want to just include the primary key, but you could include all the attributes. In addition to including the attributes from the apply data source, the Apply Node will also create some attributes based on the Cluster model we selected. In our scenario, the K-Means model will create two additional attributes. One of these will contain the Cluster ID and the other attribute will be the probability of the that cluster being valid.

To include the attributes from the source data, double click on the Apply node. This will open the Edit Apply Node window. You will see that it already contains the two attributes that will be created by the K-Mean model.

To add the attributes from the source data, click on the Data Columns tab and then click on the green ‘+’ symbol. For simplicity we are going to just select the CUSTOMER_ID. Click the OK button to finish.

Now we are ready to run the Apply node. To do this right click on the Apply Node and select Run from the drop down menu. When everything is finished you will get the little green tick mark on the top right hand corner of the Apply node.

Step 4 – View the Results

To view the results and the output produced by the Apply node, right click on the Apply node and select View Data from the drop down menu.

We get a new tab opened in SQL Developer that will contain the data. This will consist of the CUSTOMER_ID, the K-means Cluster ID and the Cluster Probability. You will see that the some of the clusters assigned will have a number and some will have the cluster labels that we assigned in a previous step.

It is now up to you to decide how you are going to use this clustering information in an operational or strategic way in your organisation.

In my next (fourth) blog post in the series on Clustering in Oracle Data Miner, I will show how you can perform similar steps, of building and evaluating clustering models, using the SQL and PL/SQL functions in the database. So we will not be using the ODM tool. We will be doing everything in SQL and SQL/PLSQL.

This entry was posted in data mining, ODM 11g R2, Oracle Advanced Analytics, Oracle Analytics Option, Oracle Data Miner, oracle data mining, Oracle Data Mining 11g R2, oug_ire.

Clustering in ODM–Part 1

Posted on February 13, 2013

This is a the first part of a five (5) part blog post on building and using Clustering in Oracle Data Miner. The following outlines the contents of each post in this series on Clustering.

This post part we will look at what clustering features exist in ODM and how to setup the data that we will be using in the examples
The second part will focus on how to building Clusters and examining the clusters produced in ODM .
The third post will focus on using the Clusters to apply to new data using ODM.
The fourth post will look at how you can build and evaluate a Clustering model using the ODM SQL and PL/SQL functions.
The fifth and final post will look at how you can apply your Clustering model to new data using the ODM SQL and PL/SQL functions.

Clustering is an unsupervised technique designed groupings of related data that are more similar to each other and are less similar to other groups. Typically clustering is used in customer segmentation analysis to try an better understand what type of customers you have.

Like with all data mining techniques, Clustering will not tell you or give you some magic insight into your data. Instead it gives you more information for you to interpret and add the business meaning to them. With Clustering you can explore the data that forms each cluster to understand what it really means.

The Clusters give by Oracle Data Miner are just patterns that it has found in the data.

Oracle has two Clustering algorithms:

K-Means : Oracle Data Miner runs an enhanced version of the typical k-means algorithm. ODM builds models in a hierarchical manner, using a top-down approach with binary splits and refinements of all nodes at the end. The centroid of the inner nodes in the hierarchy are updated to reflect changes as the tree grows. The tree grows one node at a time. The node wit the largest variance is split to increase the size of the tree until the desired number of clusters is reached.

O-Cluster : O-Cluster is an Orthogonal Partitioning Clustering that creates a hierarchical grid based clustering model. It operates recursively, generating a hierarchical structure. The resulting clusters define dense areas.

The Data Set for out Clustering examples

I’m going to use a data set that is available on OTN (somewhere) and has been used for demos in the prior versions of ODM before the 11gR2 version (SQL Developer 3). It has gone by many names but the table name we care going to use is INSURANCE_CUST_LTV.

The file is in CSV format and we will use the Import feature in SQL Developer to import it.

1. In the schema you are using for Oracle Data Miner, right click Tables in the Connections tab. The Import option will appear on the menu. Select this.

2. Go to the directory where you saved the file, select it and then click on the Open button.

3. You need to set the file Format to be ‘Delimited’ and the Delimiter set to ‘|’

4. In the next step give the table name as INSURANCE_CUST_LTV

5.In the next step Select all the Attributes. It should default to this. Click next.

6. In Step 4 of the Wizard you can set the data types for each attribute. The simplest way is to set the character attributes to VARCHAR2 (50) :

CUSTOMER_ID, LAST, FIRST, STATE, REGION, SEX, PROFESSION, BUY_INSURANCE (set this one to 3), MARITAL_STATUS, LTV_BIN

Set all the number attributes (all the others) to NUMBER without any precision or scale.

7. Click the next button and then the finish button. SQL Developer will now load 15,342 records into the INSURANCE_CUST_LTV table, with no errors (hopefully!)

We are now ready to start our work with the Clustering algorithms in ODM.

In the next blog post we will look at exploring the data, building our Clustering models and examining the clusters that were produced by ODM.

This entry was posted in data mining, ODM 11g R2, Oracle, Oracle Advanced Analytics, Oracle Analytics Option, Oracle Data Miner, oracle data mining, Oracle Data Mining 11g R2.

OUG Norway Agenda is now live

Posted on January 25, 2013

The OUG Norway spring conference (17th April – 19th April) agenda is now live and is open for registrations.

Click here for the Conference Agenda

Click here for the Conference Registration

This is a 3 day conference. The first day (17th April) will be held in the Radisson BLU Scandinavia ( Holbergsplass ) and the next two (and a bit) days will be on the Color Magic boat that will be travelling between Oslo and Kiel in Germany and back to Oslo. The boat will be arriving back in Oslo on the Saturday morning (20th April).

There will be some presentations in Norwegian, but it looks like most of the presentations will be in English. There will also be some well known names from the Oracle world presenting at this conference.

In addition to these people, I will be giving two presentations on using Predictive Analytics in Oracle using the Oracle Data Miner tool and in-database functionality.

My first presentation will be an overview of the advanced analytics option and a demonstration of what you can do using the Oracle Data Miner tool (part of SQL Developer). This presentation is currently scheduled for Thursday (18th April) at 5pm.

My second presentation will be at 9:30am on the Friday morning (19th April). In this presentation we will look at the in-database features, what can we do in SQL and PL/SQL, and we will look at what you need to do deploy you Oracle Data Mining models in a production environment.

If possible we might be able to review some new 12c new features for Oracle Data Miner Smile

This entry was posted in Brendan Tierney, Conference, data mining, data mining blog, Oracle, Oracle Advanced Analytics, Oracle Data Miner, oracle data mining, OUGN.

The ‘Oh No You Don’t’ of (Oracle) Data Science

Posted on January 17, 2013

Over the past couple of weeks I’ve had conversations with a large number of people about Data Science in the Oracle arena.

A few things have stood out. The first and perhaps the most important of these is that there is confusion of what Data Science actually means. Some think it is just another name for Statistics or Advanced Statistics, some Predictive Analytics or Data Mining, or Data Analysis, Data Architecture, etc.. The reality is it is not. It is more than what these terms mean and this is a topic for discussion for another day.

During these conversations the same questions or topics keep coming up and the simplest answer to all of these is taken from a Pantomime (Panto).

We need to have lots of statisticians
       ‘Oh No You Don’t !’
We can only do Data Science if we have Big Data
        ‘Oh No You Don’t !’
We can only do data mining/data science if we have 10’s or 100’s of Million of records
        ‘Oh No You Don’t !’
We need to have an Exadata machine
        ‘Oh No You Don’t !’
We need to have an Exalytics machine
        ‘Oh No You Don’t !’
We need extra servers to process the data
        ‘Oh No You Don’t !’
We need to buy lots of Statistical and Predictive Analytics software
        ‘Oh No You Don’t !’
We need to spend weeks statistically analysing a predictive model
        ‘Oh No You Don’t !’
We need to have unstructured data to do Data Science
        ‘Oh No You Don’t !’
Data Science is only for large companies
        ‘Oh No You Don’t !’
Data Science is very complex, I can not do it
        ‘Oh No You Don’t !’

Let us all say it together for one last time ‘Oh No You Don’t’

In its simplest form, performing Data Science using the Oracle stack, just involves learning and using some simple SQL and PL/SQL functions in the database.

Maybe we (in the Oracle Data Science world and those looking to get into it) need to adopt a phrase that is used by Barrack Obama of ‘Yes We Can’, or as he said it in Irish when he visited Ireland back in 2011, ‘Is Feidir Linn’.

Remember it is just SQL.

This entry was posted in data mining, data mining blog, ODM 11g R2, Oracle, Oracle Advanced Analytics, Oracle Analytics Option, oracle big data, Oracle Data Miner, oracle data mining, Oracle Data Mining 11g R2, oug_ire.

Association Rules in ODM-Part 4

Posted on December 19, 2012

This is a the final part of a four part blog post on building and using Association Rules in the Oracle Database using Oracle Data Miner. The following outlines the contents of each post in the series on Association Rules

This first part will focus on how to building an Association Rule model
The second post will be on examining the Association Rules produced by ODM – This blog post
The third post will focus on using the Association Rules on your data.
The final post will look at how you can do some of the above steps using the ODM SQL and PL/SQL functions.

In my previous posts I showed how you can go about setting up for Association Rule analysis in Oracle Data Miner and how to examine the rules that are generated.

This post will focus on how we build and use association rules using the functionality that is available in SQL and PL/SQL.

Step 1 – Build the Settings Table

As with all Oracle Data Mining functions in SQL and PL/SQL you will need to setup or build a settings table. This table contains all the settings necessary to run the model build functions. It is a good idea to create a separate settings table for each model build that you complete.

CREATE TABLE assoc_sample_settings (
setting_name VARCHAR2(30),
setting_value VARCHAR2(4000));

Step 2 – Define the Settings for the Model

Before you go to generate your model you need to set some of the parameters for the algorithm. To start with you need to defined that we are going to generate an Association Rules model, turn off the Automatic Data Preparation.

We can also set 3 additional settings for Association Rules.

The ASSO_MIN_SUPPORT has a default of 0.1 or 10%. That means that only rules that exist in 10% or more of the cases will be generated. This is really a figure that is too high. In the code below we will set this to a 1%. This matches the settings that we used in SQL Developer in my previous posts.

BEGIN

INSERT INTO assoc_sample_settings (setting_name, setting_value) VALUES

(dbms_data_mining.algo_name, dbms_data_mining.ALGO_APRIORI_ASSOCIATION_RULES);

INSERT into assoc_sample_settings (setting_name, setting_value) VALUES

(dbms_data_mining.prep_auto, dbms_data_mining.prep_auto_off);

INSERT into assoc_sample_settings (setting_name, setting_value) VALUES

(dbms_data_mining.ODMS_ITEM_ID_COLUMN_NAME, ‘PROD_ID’);

INSERT into assoc_sample_settings (setting_name, setting_value) VALUES

(dbms_data_mining.ASSO_MIN_SUPPORT, 0.01);

COMMIT;

END;

Step 3 – Prepare the Data

In our example scenario we are using the SALE data that is part of the SH schema. The CREATE_MODEL function needs to have an attribute (CASE_ID) that identifies the key of the shopping basket. In our case we have two attributes, so we will need to use a combined key. This combined key consists of the CUST_ID and the TIME_ID. This links all the transaction records related to the one shopping event together.

We also just need the attribute that has the information that we need. In our Association Rules (Market Basket Analysis) scenario, we will need to include the PROD_ID attribute. This contains the product key of each product that was included in the basket

CREATE VIEW ASSOC_DATA_V AS (

SELECT RANK() OVER (ORDER BY CUST_ID, TIME_ID) CASE_ID,

t.PROD_ID

FROM SH.SALES t );

Step 4 – Create the Model

We will need to use the DBMS_DATA_MINING.CREATE_MODEL function. This will use the settings in our ASSOC_SAMPLE_SETTINGS table. We will use the view created in Step 3 above and use the CASE_ID attribute we created as the Case ID in the function all.

BEGIN
   DBMS_DATA_MINING.CREATE_MODEL(
     model_name          => ‘ASSOC_MODEL_2’,
     mining_function     => DBMS_DATA_MINING.ASSOCIATION,
     data_table_name     => ‘ASSOC_DATA_V’,
     case_id_column_name => ‘CASE_ID’,
     target_column_name => null,
     settings_table_name => ‘assoc_sample_settings’);

END;

On my laptop this took approximately 5 second to run on just over 918K records involving just over 143K cases or baskets.

Now that is quick!!!

Step 5 – View the Model Outputs

There are a couple of functions that can be used to extract the rules produced in our previous step. These include:

GET_ASSOCIATION_RULES : This returns the rules from an association model.

SELECT rule_id,
       antecedent,
       consequent,
       rule_support,

rule_confidence

FROM TABLE(DBMS_DATA_MINING.GET_ASSOCIATION_RULES(‘assoc_model_2’, 10));

The 10 here returns the top 10 records or rules. GET_FREQUENT_ITEMSETS : returns a set of rows that represent the frequent item sets from an association model. In the following code we want the top 30 item sets to be returned, but filtered to only display item sets where there are 2 or more rules.

SELECT itemset_id,

items,

support,

number_of_items

FROM TABLE(DBMS_DATA_MINING.GET_FREQUENT_ITEMSETS(‘assoc_model_2’, 30))

WHERE number_of_items >= 2;

This entry was posted in Brendan Tierney, data mining, data mining blog, ODM 11g R2, Oracle, Oracle Advanced Analytics, Oracle Analytics Option, Oracle Data Miner, oracle data mining, Oracle Data Mining 11g R2, oug_ire.

Association Rules in ODM–Part 2

Posted on November 27, 2012

This is a the second part of a four part blog post on building and using Association Rules in Oracle Data Miner. The following outlines the contents of each post in the series on Association Rules

This first part will focus on how to building an Association Rule model
The second post will be on examining the Association Rules produced by ODM – This blog post
The third post will focus on using the Association Rules on your data.
The final post will look at how you can do some of the above steps using the ODM SQL and PL/SQL functions.

In the previous post I looked at the steps needed to setup a data source and to setup the Association Rule node. When everything was setup we ran the workflow.

Step 1 – Viewing the Model

We the workflow has finished running we will have the green tick marks on each node. This is where we left thing at the end of the previous post (Part 1). To view the model details, right click on the Association Role Node and select View Models from the menu.

There are 3 main concepts that are important in relation to Association Rules:

Support: is the proportion of transactions in the data set that contain the item set i.e. the number of times the rule occurs
Confidence: is the proportion of the occurrences of the antecedent that result in the consequent e.g. how many times do we get C when we have A and B {A, B} => C
Lift: indicates the strength of a rule over the random co-occurrence of the antecedent and the consequent

Support and Confidence are the primary measures that are used to access the usefulness of an association rule.

In our example we can see that the the antecedent and the consequent has numbers separated by the word AND. These numbers correspond to the product numbers.

Step 2 – Examining the Model Rules

To read the antecedent and the consequent for the first rule in our example we have:

Antecedent: 137 AND 143 AND 128

Consequent: 144

To read this association rule we would say that if a Customer bought product 137 and product 143 and product 128, then we have a Confidence value of almost 71%. This is a strong association.

We can check the ordering of the rules by changing the Sort By criteria. As Confidence and Support are the main ways to evaluate the rules, we can change the Sort By criteria to be Confidence. Then click on the Query button to refresh the rules section.

Here get a list of the strongest rules listed in descending order.

Below the section of the screen that has the Rules, we have the Rule Details section.

Here we can see that the rule gets formatted into an IF statement. The first rule in the list has a confidence of almost 97%. As it is a simple IF statement it can be easily implemented in our applications.

We want use the information that these rules provides in a number of ways. One such consequence of these rules is that we can look at improving the ordering and distribution of these products to ensure that we have sufficient numbers of each. Another consequence is that we can enhance the front end selling mechanism to make sure that if a customer is buying product 114, 118 and 115 then we can remind the customer of product 119. We can also ensure that all these products are not located beside each other, so that the customer will have to walk past many other products in order to find them. That is why we never see milk and bread beside each other in a grocery store.

Step 3 – Applying Filters to the Model Rules

In the previous step we were able to sort our rules based on some of the measures of our Association Rules and to see how these rules are structured.

Association Rule Analysis can generate many thousands of possible rules for a small data set. In some cases the similar rules can appear and we can have lots of rules that occur so infrequently that they are perhaps meaningless.

ODM provides us with a number of filters that we can apply to the rules that enables use to look for the rules that are of must interest to use. We can access these filters by clicking on the More button, that is located just under the Query button.

We can refine our query on the rules based on the various measures and the number if items in the rule. In addition to this we can also filter based on the values of the items. This is particularly useful if we want to concentrate on specific items (in our example Products). To illustrate this use focus on the rules that involve Product 115. Click on the green + symbol on the right hand side of the window. Select 115 from the list provided. Next we need to decide if we want Product 115 involved in the Antecedent or the Consequent. In our example select the Consequent. This is located to the bottom right of the window. Then click the OK button and then click on the Query button to update the list of rules that correspond with the new filter.

We can see that we only have rules that have Product 115 in the Consequent column.

We can also see that we have 134 rules for this scenarios out of a total of 20,988 (your results might differ slightly to mine and that’s OK. It really depends on what version of the sample data you are using)

Check out the next post in the series (Part 3) where we will look at how you can use the Association Rules produced by ODM.

Accepted for BIWA Summit–9th to 10th January

Posted on November 16, 2012

I received an email today to say that I had a presentation accepted for the BIWA Summit. This conference will be in the Sofitel Hotel beside the Oracle HQ in Redwood City.

The title of the presentation is “The Oracle Data Scientist” and the abstract is

Over the past 18 months we have seen a significant increase in the demand for Data Scientists. But how does someone become a data scientist. If we examine the requirements and job descriptions of this role we can see that being able to understand and process data are fundamental skills. So an Oracle developer is ideally suited to being a Data Scientist. The presentation will show how an Oracle developer can evolve into a data scientist through a number of stages, including BI developer, OBIEE developer, statistical analysis, data miner and data scientist. The tasks and tools will be discussed and explored through each of these roles. The second half of the presentation will focus on the data mining functionality available in SQL and PL/SQL. This will consist of a demonstration of an Analytics Development environment and how you can migrate (and use) your models in a Production environment

For some reason Simon Cowell of XFactor fame kept on popping into my head and it now looks like he will be making an appearance in the presentation too. You will have to wait until the conference to find out what Simon Cowell and Being an Oracle Data Scientist have in common.

Check out the BIWA Summit website for more details and to register for the event.

I’ll see you there Smile

This entry was posted in BIWA, Brendan Tierney, data mining, ODM 11g R2, Oracle, Oracle Advanced Analytics, Oracle Analytics Option, oracle big data, Oracle Data Miner, oracle data mining, Oracle Data Mining 11g R2, Oracle R Enterprise, OTN.

Events for Oracle Users in Ireland-November 2012

Posted on November 4, 2012

November (2012) is going to be a busy month for Oracle users in Ireland. There is a mixture of Oracle User Group events, with Oracle Day and the OTN Developer Days. To round off the year we have the UKOUG Conference during the first week in December.

Here are the dates and web links for each event.

Oracle User Group

The BI & EPM SIG will be having their next meeting on the Tuesday 20th November. This is almost a full day event, with presentations from End Users, Partners and Oracle product management. The main focus of the day will be on EPM, but will also be of interest to BI people.

As with all SIG meetings, this SIG will be held in the Oracle office in East Point (Block H). Things kick off at 9am and are due to finish around 4pm with plenty of tea/coffee and a free lunch too.

Remember to follow OUG Ireland on twitter using #oug_ire

Oracle Day

Oracle will be having their Oracle Day 2012, on Thursday 15th, in Croke Park. Here is some of the blurb about the event, “…to learn how Oracle simplifies IT, whether it’s by engineering hardware and software to work together or making new technologies work for the modern enterprise. Sessions and keynotes feature an elite roster of Oracle solutions experts, partners and business associates, as well as fascinating user case studies and live demos.”

This is a full day event from 9am to 5pm with 3 parallel streams focusing on Big Data, Enterprise Applications and the Cloud.

Click here to register for this event.

Click here for the full details and agenda.

OTN Developer Days

Oracle run their developer days about 3 times a year in Dublin. These events are run like a Hands-on Lab. So most of the work during the day is by yourself. You are provided with a workbook, a laptop and a virtual machine configured for the hands-on lab. This November we have the following developers days in the Oracle office in East Point, Dublin.

Tuesday 27th November (9:45-15:00) : Real Application Testing

Wednesday 28th November (9:00-14:00) : Partitioning/Advanced Compression

Thursday 29th November (9:15-13:30) : Database Security

Friday 30th November (9:45-16:00) : Business Process Management Using BPM Suite 11g

As you can see we have almost a full week of FREE training from Oracle. So there is no reason not to sign up for these days.

UKOUG Conference – in Birmingham

In December we have the annual UKOUG Conference. This is the largest Oracle User Group conference in Europe and the largest outside of the USA. At this conference you will have some of the main speakers and presentations from Oracle Open World, along with a range of speakers from all over the work.

In keeping with previous years there will be the OakTable Sunday and new this year there will be a Middleware Sunday. You need to register separately for these events. Here are the links

OakTable Sunday

Middleware Sunday

The main conference kicks off on the Monday morning with a very full agenda for Monday, Tuesday and Wednesday. There are a number of social events on the Monday and Tuesday, so come well rested.

On the Monday evening there is the focus pubs. This year it seems to have an Irish Pub theme. At the focus pub event there will be table for each of the user group SIGs.

Come and join me at the Ireland table on the Monday evening.

The full agenda in now live and you can get all the details here.

I will be giving a presentation on the Tuesday afternoon titled Getting Real Business Value from Predictive Analytics (OBIEE and Oracle Data Mining). This is a joint presentation with Antony Heljula of Peak Indicators.

This entry was posted in Brendan Tierney, data mining, data mining blog, database, Oracle, Oracle Analytics Option, oracle big data, Oracle Data Miner, oracle data mining, Oracle Data Mining 11g R2, oug_ire, UKOUG.

Oracle Advanced Analytics Option in Oracle 12c

Posted on October 20, 2012

At Oracle Open World a few weeks ago there was a large number of presentations on Big Data and Analytics. Most of these were marketing type presentations, with a couple of presentations on using R and how it can not be integrated into the Oracle Database 11.2.

In addition this these there was one presentation that focused on the Oracle Advanced Analytics (OAA) Option.

The Oracle Advanced Analytics Option covers the Oracle Data Mining features and the Oracle R Enterprise features in the Database.

The purpose of this blog post is to outline and summarise what was mentioned at these presentations, and will include what changes are/may be coming in the “Next Release” of the database i.e. Oracle 12c.

Health Warning: As with all the presentations at OOW that talked about what may be in or may be in the next release, there is not guarantee that the features will actually be in the release version of the database. Here is the slide that gives the Safe Harbor statement.

12c will come with R embedded into it. So there will be no need for any configurations.
Oracle R client will come as part of the server install.
Oracle R client will be able to use the Analytics functions that exist in the database.
Will be able to run R code in the database.
The database (12c) will be able to spawn multiple R engines.
Will be able to emulate map-reduce style algorithms.
There will be new PREDICTION function, replacing the existing (11g) functionality. This will combine a number of steps of building a model and applying it to the data to be scored into one function. But we will still need the functionality of the existing PREDICTION function that is in 11g. So it will be interesting to see how this functionality will be kept in addition to the new functionality being proposed in 12c.
Although the Oracle Data Miner tool will still exits and will have many new features. It was also referred to as the ‘OAA Workflow’. So those this indicate a potential name change? We will have to wait and see.
Oracle Data Miner will come with a new additional graphing feature. This will be in addition to the Explore Node and will allow us to produce more typical attribute related graphs. From what I could see these would be similar to the type of box plot, scatter, bar chart, etc. graphs that you can get from R.
There will be a number of new algorithms too, including a useful One Class Support Vector Machine. This can be used when we have a data set with just one class value. This algorithm will work out what records/cases are more important and others.
There will be a new SQL node. This will allow us to write our own data transformation code.
There will be a new node to allow the calling of R code.
The tool also comes with a slightly modified layout and colour scheme.

Again, the points that I have given above are just my observations. They may or may not appear in 12c, or maybe I misunderstood what was being said.

It certainly looks like we will have a integrate analytics environment in 12c with full integration of R and the ODM in-database features.

This entry was posted in Brendan Tierney, data mining, data mining blog, ODM 11g R2, Oracle, Oracle Advanced Analytics, Oracle Analytics Option, oracle big data, Oracle Data Miner, oracle data mining, Oracle Data Mining 11g R2, Oracle R Enterprise, oraclebigdata, ORE, oug_ire, R.

Extracting the rules from an ODM Decision Tree model

Posted on October 17, 2012

One of the most interesting of important aspects of a Decision Model is that we as a user can get to see what rules the machine learning algorithm has generated for our data.

I’ve give a number of examples in various blog posts over the past few years on how to generate a number of classification models. An example of the workflow is below.

In the Class Build node we get four models being generated. These include a Generalised Linear Model, Support Vector Machine, Naive Bayes and a Decision Tree model.

We can explore the Decision Tree model by right clicking on the Class Build Node, selecting View Models and then the Decision Tree model, which will be labelled with a ‘DT’ in the name.

As we explore the nodes and branches of the Decision Tree we can see the rule that was generated for a node in the lower pane of the applications. So by clicking on each node we get a different rule appearing in this pane

Sometimes there is a need to extract this rules so that they can be presented to a number of different types of users, to explain to them what is going on.

How can we extract the Decision Tree rules?

To do this, you will need to complete the following steps:

From the Models section of the Component Palette select the Model Details node.
Click on the Workflow pane and the Model Details node will be created
Connect the Class Build node to the Model Details node. To do this right click on the Class Build node and select Connect. Then move the mouse to the Model Details node and click. The two nodes should now be connected.
Edit the Model Details node, uncheck the Auto Settings, select Model Type to be Decision Tree, Output to be Full Tree and all the columns.

Run the Model Details node. Right click on the node and select run. When complete you you will have the little green box with a tick mark, on the top right hand corner.
To view the details produced, right click on the Model Details node and select View Data
The rules for each node will now be displayed. You will need to scroll to the right of this pane to get to the rules and you will need to expand the columns for the rules to see the full details

My Presentations on Oracle Advanced Analytics Option

Posted on October 12, 2012

I’ve recently compiled my list of presentation on the Oracle Analytics Option. All these presentations are for a 45 minute period.

I have two versions of the presentation ‘How to do Data Mining in SQL & PL/SQL’, one is for 45 minutes and the second version is for 2 hour.

I have given most of these presentations at conferences or SIGS.

Let me know if you are interesting in having one of these presentations at your SIG or conference.

Oracle Analytics Option – 12c New Features – available 2013
Real-time prediction in SQL & Oracle Analytics Option – Using the 12c PREDICTION function – available 2013
How to do Data Mining in SQL & PL/SQL
From BIG Data to Small Data and Everything in Between
Oracle R Enterprise : How to get started
Oracle Analytics Option : R vs Oracle Data Mining
Building Predictive Analysts into your Forms Applications
Getting Real Business Value from OBIEE and Oracle Data Mining (This is a cut down and merged version of the follow two presentations)
Getting Real Business Value from OBIEE and Oracle Data Mining – Part 1 : The Oracle Data Miner part
Getting Real Business Value from OBIEE and Oracle Data Mining – Part 2 : The OBIEE part
How to Deploying and Using your Oracle Data Miner Models in Production
Oracle Analytics Option 101
From SQL Programmer to Data Scientist: evolving roles of an Oracle programmer
Using an Oracle Oracle Data Mining Model in SQL & PL/SQL
Getting Started with Oracle Data Mining
You don’t need a PhD to do Data Mining

Check out the ‘My Presentations’ page for updates on new presentations.

This entry was posted in Brendan Tierney, data mining, data mining blog, database, ODM 11g R2, Oracle, Oracle Advanced Analytics, Oracle Analytics Option, oracle big data, Oracle Data Miner, oracle data mining, oraclebigdata, ORE, oug_ire.

← Previous
1
2
3
4
…
7
Next →

Ora-lytics

By Brendan Tierney

oracle data mining

Clustering in Oracle Data Miner–Part 4

Clustering in Oracle Data Miner–Part 3

Clustering in ODM–Part 1

OUG Norway Agenda is now live

The ‘Oh No You Don’t’ of (Oracle) Data Science

Association Rules in ODM-Part 4

Association Rules in ODM–Part 2

Accepted for BIWA Summit–9th to 10th January

Events for Oracle Users in Ireland-November 2012

Oracle Advanced Analytics Option in Oracle 12c

Extracting the rules from an ODM Decision Tree model

My Presentations on Oracle Advanced Analytics Option

oracle data mining

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: