In a previous blog post I gave some details of how you can examine some of the details behind a prediction made using a classification model. This seemed to spark a lot of interest. But before I come back to looking at classification prediction details and other information, this blog post is the first in a 4 part blog post on examining the details of Clusters, as identified by a cluster model created using Oracle Data Mining.
The 4 blog posts will consist of:
- 1 – (this blog post) will look at how to determine the predicted cluster and cluster probability for your record.
- 2 – will show you how to examine the details behind and used to predict the cluster.
- 3 – A record could belong to many clusters. In this blog post we will look at how you can determine what clusters a record can belong to.
- 4 – Cluster distance is a measure of how far the record is from the cluster centroid. As a data point or record can belong to many clusters, it can be useful to know the distances as you can build logic to perform different actions based on the cluster distances and cluster probabilities.
Right. Let’s have a look at the first set of these closer functions. These are CLUSTER_ID and CLUSTER_PROBABILITY.
CLUSER_ID : Returns the number of the cluster that the record most closely belongs to. This is measured by the cluster distance to the centroid of the cluster. A data point or record can belong or be part of many clusters. So the CLUSTER_ID is the cluster number that the data point or record most closely belongs too.
CLUSTER_PROBABILITY : Is a probability measure of the likelihood of the data point or record belongs to a cluster. The cluster with the highest probability score is the cluster that is returned by the CLUSTER_ID function.
Now let us have a quick look at the SQL for these two functions. This first query returns the cluster number that each record most strong belongs too.
SELECT customer_id, cluster_id(clus_km_1_37 USING *) as Cluster_Id, FROM insur_cust_ltv_sample WHERE customer_id in ('CU13386', 'CU6607', 'CU100');
Now let us add in the cluster probability function.
SELECT customer_id, cluster_id(clus_km_1_37 USING *) as Cluster_Id, cluster_probability(clus_km_1_37 USING *) as cluster_Prob FROM insur_cust_ltv_sample WHERE customer_id in ('CU13386', 'CU6607', 'CU100');
These functions gives us some insights into what the cluster predictive model is doing. In the remaining blog posts in this series I will look at how you can delve deeper into the predictions that the cluster algorithm is make.
When building predictive models the data scientist can spend a large amount of time examining the models produced and how they work and perform on their hold out sample data sets. They do this to understand is the model gives a good general representation of the data and can identify/predict many different scenarios. When the “best” model has been selected then this is typically deployed is some sort of reporting environment, where a list is produced. This is typical deployment method but is far from being ideal. A more ideal deployment method is that the predictive models are build into the everyday applications that the company uses. For example, it is build into the call centre application, so that the staff have live and real-time feedback and predictions as they are talking to the customer.
But what kind of live and real-time feedback and predictions are possible. Again if we look at what is traditionally done in these applications they will get a predicted outcome (will they be a good customer or a bad customer) or some indication of their value (maybe lifetime value, possible claim payout value) etc.
But can we get anymore information? Information like what was reason for the prediction. This is sometimes called prediction insight. Can we get some details of what the prediction model used to decide on the predicted value. In more predictive analytics products this is not possible, as all you are told is the final out come.
What would be useful is to know some of the thinking that the predictive model used to make its thinking. The reasons when one customer may be a “bad customer” might be different to that of another customer. Knowing this kind of information can be very useful to the staff who are dealing with the customers. For those who design the workflows etc can then build more advanced workflows to support the staff when dealing with the customers.
Oracle as a unique feature that allows us to see some of the details that the prediction model used to make the prediction. This functions (based on using the Oracle Advanced Analytics option and Oracle Data Mining to build your predictive model) is called PREDICTION_DETAILS.
When you go to use PREDICTION_DETAILS you need to be careful as it will work differently in the 11.2g and 12c versions of the Oracle Database (Enterprise Editions). In Oracle Database 11.2g the PREDICTION_DETAILS function would only work for Decision Tree models. But in 12c (and above) it has been opened to include details for models created using all the classification algorithms, all the regression algorithms and also for anomaly detection.
The following gives an example of using the PREDICTION_DETAILS function.
select cust_id, prediction(clas_svm_1_27 using *) pred_value, prediction_probability(clas_svm_1_27 using *) pred_prob, prediction_details(clas_svm_1_27 using *) pred_details from mining_data_apply_v;
The PREDICTION_DETAILS function produces its output in XML, and this consists of the attributes used and their values that determined why a record had the predicted value. The following gives some examples of the XML produced for some of the records.
I’ve used this particular function in lots of my projects and particularly when building the applications for a particular business unit. Oracle too has build this functionality into many of their applications. The images below are from the HCM application where you can examine the details why an employee may or may not leave/churn. You can when perform real-time what-if analysis by changing some of attribute values to see if the predicted out come changes.
After a bit of a long wait Oracle have finally release Oracle Data Visualization for the desktop. The desktop version of this tool is only available for Windows desktops at the moment. I’m sure Oracle will be bringing out versions of other OS soon (I hope).
To get you hands on the Oracle Data Visualization to to the following OTN webpage (click on this image)
After downloading has finished, you can run the installer.
When the Oracle Installer opens you will be prompted to enter the required details or to accept the defaults, as outlined below.
- Installation Location : Decide where you are going to have the Oracle Data Visualization tool installed on your desktop. The default location is
C:\Program Files\Oracle Data Visualization Desktop. Click Next
- Options : There are 2 check boxes for ‘Create desktop shortcut’ and ‘Deploy samples’. Leave both of these checked, as you will probably want these. Click Next.
- Summary : Lists a summary of the installation. There is nothing really for you to do here, so on the Install button.
- Progress : You can ten sit back and monitor the progress of the installation. The installation tool about 4 minutes on my small Windows VM
When the installation is complete you can now fire up Oracle Data Visualization and enjoy. If you have just installed the tool it will automatically be started for you.
When the tool has finished all the configurations that it needs to do, the tool will open with the following window and shows a sample projects for you to get an idea of some of the things that are possible.
For more details on the tool and on the Oracle Cloud hosted version click on the following image to get to the Oracle webpage for the product.
When you comes to working with R to access and process your data there are a number of little features and behaviours you need to look out for.
One of these is the DATE datatype.
The main issue that you have to look for is the TIMEZONE conversion that happens then you extract the data from the database into your R environment.
There is a datatype conversions from the Oracle DATE into the POSIXct format. The POSIXct datatype also includes the timezone. But the Oracle DATE datatype does not have a Timezone part of it.
When you look into this a bit more you will see that the main issue is what Timezone your R session has. By default your R session will inherit the OS session timezone. For me here in Ireland we have the time timezone as the UK. You would time that the timezone would therefore be GMT. But this is not the case. What we have for timezone is BST (or British Standard Time) and this takes into account the day light savings time. So on the 26th May, BST is one hour ahead of GMT.
This issue is not just between R and Oracle, but there are some inconsistencies in some of Oracle’s other tools list SQL Developer, SQLcl and SQL*Plus.
SQLcl & SQL Developer
OK. Let’s get back to the issue with we are using R to analyse our data.
As mentioned above, when I select date of type DATE from Oracle into R, using ROracle, I end up getting a different date value than what was in the database. Similarly when I process and store the data.
The following outlines the data setup and some of the R code that was used to generate the issue/problem.
Create a table that contains a DATE field and insert some records.
CREATE TABLE STAFF (STAFF_NUMBER VARCHAR2(20), FIRST_NAME VARCHAR2(20), SURNAME VARCHAR2(20), DOB DATE, PROG_CODE VARCHAR2(6 BYTE), PRIMARY KEY (STAFF_NUMBER)); insert into staff values (123456789, 'Brendan', 'Tierney', to_date('01/06/1975', 'DD/MM/YYYY'), 'DEPT_1'); insert into staff values (234567890, 'Sean', 'Reilly', to_date('21/10/1980', 'DD/MM/YYYY'), 'DEPT_2'); insert into staff values (345678901, 'John', 'Smith', to_date('12/03/1973', 'DD/MM/YYYY'), 'DEPT_3'); insert into staff values (456789012, 'Barry', 'Connolly', to_date('25/01/1970', 'DD/MM/YYYY'), 'DEPT_4');
You can query this data in SQL without any problems. As you can see there is no timezone element to these dates.
Selecting the data
I now establish my connection to my schema in my 12c database using ROracle. I won’t bore you with the details here of how to do it but check out point 3 on this post for some details.
When I select the data I get the following.
> res<-dbSendQuery(con, “select * from staff”) > data data$DOB
 “1975-06-01 01:00:00 BST” “1980-10-21 01:00:00 BST” “1973-03-12 00:00:00 BST”
 “1970-01-25 01:00:00 BST”
As you can see two things have happened to my date data when it has been extracted from Oracle. Firstly it has assigned a timezone to the data, even though there was no timezone part of the original data. Secondly it has performed some sort of timezone conversion to from GMT to BST. The difference between GMT and BTS is the day light savings time. Hence the 01:00:00 being added to the time element that was extract. This time should have been 00:00:00. You can see we have a mixture of times!
So there appears to be some difference between the R date or timezone to what is being used in Oracle.
To add to this problem I was playing around with some dates and different records. I kept on getting this scenario but I also got the following, where we have a mixture of GMT and BST times and timezones. I’m not sure why we would get this mixture.
 “1995-01-19 00:00:00 GMT” “1965-06-20 01:00:00 BST” “1973-10-20 01:00:00 BST”
 “2000-12-28 00:00:00 GMT”
This is all a bit confusing and annoying. So let us look at how you can now fix this.
Fixing the problem : Setting Session variables
What you have to do to fix this and to ensure that there is consistency between that is in Oracle and what is read out and converted into R (POSIXct) format, you need to define two R session variables. These session variables are used to ensure the consistency in the date and time conversions.
These session variables are TZ for the R session timezone setting and Oracle ORA_SDTZ setting for specifying the timezone to be used for your Oracle connections.
The trick there is that these session variables need to be set before you create your ROracle connection. The following is the R code to set these session variables.
> Sys.setenv(TZ = “GMT”)
> Sys.setenv(ORA_SDTZ = “GMT”)
So you really need to have some knowledge of what kind of Dates you are working with in the database and if a timezone if part of it or is important. Alternatively you could set the above variables to UDT.
Selecting the data (correctly this time)
Now when we select our data from our table in our schema we now get the following, after reconnecting or creating a new connection to your Oracle schema.
 “1975-06-01 GMT” “1980-10-21 GMT” “1973-03-12 GMT” “1970-01-25 GMT”
Now you can see we do not have any time element to the dates and this is correct in this example. So all is good.
We can now update the data and do whatever processing we want with the data in our R script.
But what happens when we save the data back to our Oracle schema. In the following R code we will add 2 days to the DOB attribute and then create a new table in our schema to save the updated data.
I’ve used the R package Libridate to do the date and time processing.
 “1975-06-01 GMT” “1980-10-21 GMT” “1973-03-12 GMT” “1970-01-25 GMT”
> data$DOB data$DOB
 “1975-06-03 GMT” “1980-10-23 GMT” “1973-03-14 GMT” “1970-01-27 GMT”
> dbWriteTable(con, “STAFF_2”, data, overwrite = TRUE, row.names = FALSE)
When we look at this newly created table in our Oracle schema we will see that we don’t have DATA datatype for DOB, but instead it is created using a TIMESTAMP data type.
If you are working with TIMESTAMP etc type of data types (i.e. data types that have a timezone element that is part of it) then that is a slightly different problem. Perhaps one that I’ll look at soonish.