PMML (Predictive Model Markup Langauge) is an XML formatted output that defines the core elements and settings for your Predictive Models. This XML formatted output can be used to migrate your models from one data mining or predictive modelling tool to another data mining or predictive modelling tool, such as Oracle.
Using PMML to migrate your models from one tool to another allows for you to use the most appropriate tools for developing your models and then allows them to be imported into another tool that will be used for deploying your predictive models in batch or real-time mode. In particular the ability to use your Predictive Model within your everyday applications enables you to work in the area of Automatic or Prescriptive Analytics. Oracle Data Mining and the Oracle Database are ideal or even the best possible tools to allow for Automatic and Prescriptive Analytics for your transa
PMML is an XML based standard specified by the Data Mining Group
Oracle Data Mining supports the importing of PMML models that are compliant with version 3.1 of the standard and for Regression Models only. The regression models can be for linear regression or binary logistic regression.
The Data Mining Group Archive webpage have a number of sample PMML files for you to download and then to load into your Oracle database.
To Load the PMML file into your Oracle Database you can use the DBMS_DATA_MINING.IMPORT_MODEL function. I’ve given examples of how you can use this function to import an Oracle Data Mining model that was exported using the EXPORT_MODEL function.
The syntax of the IMPORT_MODEL function when importing a PMML file is the following
DBMS_DATA_MINING.IMPORT_MODEL ( model_name IN VARCHAR2, pmmldoc IN XMLTYPE strict_check IN BOOLEAN DEFAULT FALSE);
The following example shows how you can load the version 3.1 Logistic Regression PMML file from the Data Mining Group archive webpage
BEGIN dbms_data_mining.IMPORT_MODEL (‘PMML_MODEL', XMLType (bfilename (‘IMPORT_DIR', 'sas_3.1_iris_logistic_reg.xml'), nls_charset_id ('AL32UTF8') )); END;
This example uses the default value for STRICT_CHECK as FALASE. In this case if there are any errors in the PMML structure then these will be ignored and the imported model may contain “features” that may make it perform in a slightly odd manner.
When you are working with and developing Decision Trees by far the easiest way to visualise these is by using the Oracle Data Miner (ODMr) tool that is part of SQL Developer.
Developing your Decision Tree models using the ODMr allows you to explore the decision tree produced, to drill in on each of the nodes of the tree and to see all the statistics etc that relate to each node and branch of the tree.
But when you are working with the DBMS_DATA_MINING PL/SQL package and with the SQL commands for Oracle Data Mining you don’t have the same luxury of the graphical tool that we have in ODMr. For example here is an image of part of a Decision Tree I have and was developed using ODMr.
What if we are not using the ODMr tool? In that case you will be using SQL and PL/SQL. When using these you do not have luxury of viewing the Decision Tree.
So what can you see of the Decision Tree? Most of the model details can be used by a variety of functions that can apply the model to your data. I’ve covered many of these over the years on this blog.
For most of the data mining algorithms there is a PL/SQL function available in the DBMS_DATA_MINING package that allows you to see inside the models to find out the settings, rules, etc. Most of these packages have a name something like GET_MODEL_DETAILS_XXXX, where XXXX is the name of the algorithm. For example GET_MODEL_DETAILS_NB will get the details of a Naive Bayes model. But when you look through the list there doesn’t seem to be one for Decision Trees.
Actually there is and it is called GET_MODEL_DETAILS_XML. This function takes one parameter, the name of the Decision Tree model and produces an XML formatted output that contains the attributes used by the model, the overall model settings, then for each node and branch the attributes and the values used and the other statistical measures required for each node/branch.
The following SQL uses this PL/SQL function to get the Decision Tree details for model called CLAS_DT_1_59.
If you are using SQL Developer you will need to double click on the output column and click on the pencil icon to view the full listing.
Nothing too fancy like what we get in ODMr, but it is something that we can work with.
If you examine the XML output you will see references to PMML. This refers to the Predictive Model Markup Language (PMML) and this is defined by the Data Mining Group (www.dmg.org). I will discuss the PMML in another blog post and how you can use it with Oracle Data Mining.
In my previous blog post I showed you how you can have a look at the transformations that the Automatic Data Preparation (ADP) feature of Oracle Data Mining produces. I also gave some example of the different types of ADF that are performed for different algorithms.
One of the features of the transformations produced is that it will generate a REVERSE_EXPRESSION. This will take the scored results and apply the inverse of the transformation that was performed when the data was being prepared for input to the algorithm.
Somethings you may want to have the scored data returned in a slightly different ways or labeled in a slightly different way.
In this blog post I will show you how to define an alternative REVERSE_EXPRESSION for an attribute.
The function we need to use for this is the ALTER_REVERSE_EXPRESSION procedure that is part of the DBMS_DATA_MINING package.
When we score data for a typical classification problem we typically use 0 (zero) and 1 to be the target variable values. But what if we wanted the output from our classification model to label the scored data slighted differently.
In this case we can use the ALTER_REVERSE_EXPRESSION procedure to define the new values. What if we wanted the zero to be labeled as NO and the 1 as YES. In this case we can use the following.
model_name => ‘CLAS_NB_1_59’,
expression => ‘decode(affinity_card, ”1”, ”YES”, ”NO”)’,
attribute_name => ‘AFFINITY_CARD’);
When we view the transformations for our data mining model we can now see the transformation.
Now when we score our data the predicted target variable will now have our newly defined values.
PREDICTION(CLAS_NB_1_59 USING *) PRED
FETHC FIRST 5 ROWS ONLY;
You can see that this is a very powerful feature and allows use to turn the scored data values is a different way to make them more useful. This is particularly the case as we work towards a more Automatic type of Predictive Analytics.
A very powerful feature of Oracle Data Mining and one that I think does not get enough notice is called Automatic Data Preparation.
Data Preparation is one of the most time consuming, repetitive and boring parts of the work that a Data Miner or Data Scientist performs as part of their daily tasks. Apart from gathering the data, integrating the data, getting the data into the required formation the most interesting part of the work is with feature engineering.
Then you have all the other boring data preparation tasks of how to handle missing data, type conversion, binning, normalization, outlier treatment etc.
With Automatic Data Preparation (ADP) in Oracle Data Mining you can let Oracle work all of these things out for you and to perform all the necessary coding and to store all of this coding as part of the in-database data mining model.
This is Fantastic. This ADP feature can same you hours and in some cases days of effort.
But (there is always a but 🙂 ) what if you are a bit unsure if the transformations that are being performed are exactly what you would wanted. Maybe you would like to see what Oracle is doing and depending on this you can do it a different way.
The first step is to examine the transformations that are generated by stored as part of the in-database data mining model. The DBMS_DATA_MINING package has a function called GET_MODEL_TRANSFORMATIONS. When you query this function, passing in the name of the data mining model, you will get returned the list of transformations that have been applied to each model.
In the following example a GLM model was created using the Oracle Data Miner tool (that is part of SQL Developer). When you use Oracle Data Miner, ADP is automatically turned on.
The following query calls the GET_MODEL_TRANSFORMATIONS function with the data mining model called CLAS_GLM_1_59/.
SELECT * FROM TABLE(DBMS_DATA_MINING.GET_MODEL_TRANSFORMATIONS(‘CLAS_GLM_1_59’));
The following image contains the output generated by this query.
When you look at the data under the EXPRESSION column we get to see what the ADP did to the data. In most of the cases there are just some simple data clean-up being performed and formatting for getting the data ready for input into the algorithm.
If we now look at the Naive Bayes model for the same data set we get a very different sent of transformations being listed under the EXPRESSION column.
SELECT * FROM TABLE(DBMS_DATA_MINING.GET_MODEL_TRANSFORMATIONS(‘CLAS_NB_1_59’));
Now we get to see some of the data binning that ADP performs and is required for input to the Naive Bayes algorithm. You will also notices that we also have some transformations in the REVERSE_EXPRESSION column. These are the inverse or reverse of the transformation that was generated in the EXPRESSION column.
I will let you explore the data transformations that are produced by ADP for the SVM and Decision Tree algorithms.
I will show you how you change the reverse expression in my next blog post, as there are times when you might want the data to be presented slightly differently after the model has been run to score your data.
To get more details of what Automatic Data Preparation is performed for each data mining algorithm you can check out this link in the 11g documentaion. This section seems to be missing from the online 12c documentation.
Everyone is doing advanced analytics. Right? Hmm
Everyone is talking about advanced analytics? Yes that is true.
Everyone is an expert in advanced analytics? This is so not true. Watch out for these Great Pretenders. You know what I mean! You know who I mean! Maybe you know some of them already? If not, watch out for these Great Pretenders!!!
Some people are going around talking about data mining, predictive analytics, advanced analytics, machine learning etc as if this is some new topic. Well it isn’t. It isn’t anything new and most of the techniques have been about for 10, 20, 30+ years.
Some people are saying you should only use language X or tool Y because. Everything else is basically rubbish.
What we do have is a wider understanding of how to use these techniques on our various data sources.
What we have is a lot more tools that allow us to perform these tasks a lot easier, at greater speed, with more functionality and without the need to fully understand the hard core maths that is going on behind the scenes.
What we have is a lot more languages to perform these tasks and to support the vast amount of work that goes into understanding the data and preparing the data.
Someone thing for all of us to watch out for, when we ready about these topics, is what kind of problem area they are addressing. The following table illustrates the three main types or categories of Analytics. These categories are Descriptive Analytics, Predictive Analytics and Prescriptive Analytics. I think most people would agree that the Descriptive and Predictive Analytics categories are very mature at this stage. With Predictive Analytics we are perhaps still evolving in this category and a lot more work needs to be done before this this become wide spread.
Some people talk as if Predictive Analytics is some new and exciting topic. But isn’t all that new. It was been around for the past 30+ years. If you go back over the Gartner Hype Cycle that comes out every September, Predictive Analytics is no longer being shown on this graph. The last time it appeared on the Gartner Hype Cycle was back in 2013 and it was positioned on the far right of the graph in the section called Plateau of Productivity.
So Predictive Analytics is very mature and main stream. Part of the reason that it is main stream is that Predictive Analytics has allowed for a new category of Analytics to evolve and this is Automatic Analytics.
Automatic Analytics is where Advanced and Predictive Analytics has been build into our day to day applications that are used to run our business. We do not need the hard core type of data scientists to perform various analytic on our data. Instead these task, once they have been defined, can then be added to our applications to process, evaluate and make decisions all automatically. This is were we need the data scientists to be able to communicate with the business and be able to work with them to solve real world business projects. This is a different type of data scientist to the “hard” core data scientist who delves into the various statistical methods, machine learning methods, data management methods, etc.
The following table extends the table given above to include Automatic Analytics, and is my own take on how and where Automatic Analytics fits.
Every time we get an insurance quote, health insurance quote, get a “random” call from our Telco offering a free upgrade, get our loyalty card statements, get a loan from the bank, look at or buy a book on Amazon, etc. the list could go on and on, but these are all examples of how predictive analytics has been automated into our everyday business application.
But this is nothing new. When I first got into data mining/predictive analytics over 16 years ago, it was considered a common thing that certain types of companies did. What has happened in the time since and particularly in the past few years is that a lot more people are seeing the value in using it.
Before I finish off this post we can have a quick look at what Oracle has been doing in this area. They have their Advanced Analytics Option and Real-Time Decisions tools to all data scientists do their magic. But over the past X years (nobody can give me an exact number) they have been very, very active in building in lots and lots of predictive analytics into their various business applications, particularly with into with Fusion Apps and BI Apps.
A recent quote from Oracle highlights their aim with this,
” … products designed to close the gap between data scientists and businesses.“
Now with Oracle making a big push to the cloud, they are busy adding in more and more Automatic (Predictive) Analytics into their Cloud Applications. What we need from Oracle is a clearer identification of where they have done this. Plus with the migration of their Apps to the cloud, their Advanced Analytics Option is a core part of their Cloud platform. As they upgrade or add new features into their Cloud Apps, you will now be able to get the benefit of these Automatic (Predictive) Analytics as they come available.
Check out the video of US President talking about Data Science and the first Chief Data Scientist of the USA talks about his mission.
When you are working on building classification models you will need some ways of measuring the effectiveness of each model that you will build. This measurement/evaluation is perform during the model build process.
Typically the model build process consists of 2 steps (I’m assuming all data preparation etc has been completed:
- Build the model: During this step you will feed in a portion of your data set to the data mining algorithm. Typical this data will be a subset of your data set and will typically consist of 60% to 70% of the data. This data is used to by the data mining algorithm to build the model.
- Test the model: After the model has been built you will need to test the model to see how efficient it is at making the predictions. This is where we use the data that was not used to build the model. For this data we already know the outcome. So after we have applied the model to this data subset we can measure the predicted values against the actual values.
Most of the data mining tools will automate these two steps, specifically the splitting the data into the build and test data sets. But if you are using a language like R, etc then you will need to manually perform these steps.
The most common way of collating the test results is to use the Confusion Matrix. This allows us to layout the correct predictions, the incorrect predictions and to perform a number of other statistical measurements.
The last two of the above values are also commonly referred to in statistics as Type 1 (false positive) and Type 2 (false negative) errors.
Depending on your project you will concentrate on a combination of the true and false values of either the Positives or the negatives.
For example, in Medical Diagnostics for cancer, you will be looking to keep the False Negatives to a minimum. This is where you have predicted someone does not have cancer, but actually does. The consequence of this is that the person is not brought back for addition testing and we all know what will happen. On the other hand it is OK to have a hight False Positive in this case. In this scenario you bring the person back for additional tests and discover that they are all clear 🙂
Precision = How many of the selected items are relevant? (as a percentage)
Recall = How many of the relevant items are selected? (as a percentage)
Accuracy = How many did we correctly predict? (as a percentage)
The following table illustrates these measurements and tests.
There are lots of other statistical tests that can be performed on your results. Everyone will have their own preferences. What I have highlighted here are the main statistical test for you to look at.
You cannot use one or a few of the statistical tests to make a decision on what data mining model works best for your data. It is a combination of these statistical test, your understanding of the data and you understanding of the business project that need to be considered.
In my next 2 blog posts I will show you how you can perform these tests on the results generated by the Oracle Data Miner tool and then on the Oracle Data Miner models produced using PL/SQL.