Oracle Analytics Option
Predicting using ORE package
In a previous post I gave a an overview of the various in-database data mining algorithms that you can use in your Oracle R Enterprise scripts.
To create data mining models based on those algorithms you need to use the ore.odm functions.
After you have developed and tested your models you will select one of these to score your new data.
How can you do this using ORE? There is a suite of ORE functions called ore.predict that you can use to apply your data mining model to score or label new data.
The following table lists the ore.predict functions:
table,th,td { border:1px solid black; border-collapse:collapse }
ORE Predict Function | Description |
---|---|
ore.predict-glm | Generalized linear model |
ore.predict-kmeans | k-Means clustering mode |
ore.predict-lm | Linear regression model |
ore.predict-matrix | A matrix with no more than 1000 rows |
ore.predict-multinom | Multinomial log-linear model |
ore.predict-nnet | Neural network models |
ore.predict-ore.model | An Oracle R Enterprise model |
ore.predict-prcomp | Principal components analysis on a matrix |
ore.predict-princomp | Principal components analysis on a numeric matrix |
ore.predict-rpart | Recursive partitioning and regression tree model |
As you will see from the above table there are more ore.predict functions than there are ore.odm functions. The reason for this is that ORE comes with some additional data mining algorithms. These are in addition to the sub-set of Oracle Data Mining algorithms that it uses. These include the ore.glm, ore.lm, ore.neural and ore.stepwise.
You also need to watch out for the data mining algorithms that are not used in prediction. These include the Minimum Description Length, Apriori and Non-Negative Matrix Factorization.
Remember that these ore.predict functions are run inside the Oracle Database. No data is extracted to the data analyst laptop or desktop. All the data stays in the database. The ORE functions are run in the database on the data in the database
Issues with using latest release of ODM
The title of this blog post makes it sound more dramatic than it actually is.
The reason for this blog post is down to me receiving a recent comment on the blog, plus having received numerous emails and a recent OTN Discussion Forum topic for Oracle Data Mining.
The main thing that they have in common is that if I use the latest version of Oracle Data Mining (ODM) it tells me that I need to upgrade my ODM Repository. What impact will this have?
The ODM Repository stores lots of information about the workflows you create using the (free) Oracle Data Mining tool that comes as part of SQL Developer. Yes you do have to pay for the OAA option, so is it really free? Well some part are like the explore node and the graph node.
If you download and want to use the latest version of the ODM tool or you want to try it out before rolling it out to others then you will need to upgrade your ODM repository.
And this the problem that people are facing.
If you upgrade then the ODM Repository it is updated to work with the latest version of the ODM tool. But what happens to everyone else who is using the previous release of the tool? The answer to that is they can no longer use ODM against their database.
Why is that? Well the version of the tool is tied to a version of the Repository. If you upgrade to the newer tool and repository then your older versions of the ODM tool no longer work.
The result of all of this is that you cannot have a mixture of versions of the ODM tool (SQL Developer) being used in your team/company.
There is a very simple solution to all of this. Everyone uses the same version of the ODM tool (i.e. the same version of SQL Developer). For example your team might be using SQL Dev 4 that was released last December. But in early March there was a new patch release 4.1. In order to use this new version of the tool all of your team needs to start using it at the same time. The first person to use it will be prompted to migrate the ODM repository. This is automatically done once you enter the password for SYS.
But in some teams this is not possible to do, you want to try out the tool to see that it works correctly before getting others to use it. The way around this is to have a separate database and use it for your testing. You can easily copy across your workflows and ODM objects to the test database.
This might not be possible for everyone, so what can you do. Create a Virtual Machine and try it out on your own desktop is one way.
The answer to this problem is not ideal, but hopefully you have a better idea of why things are happening this way and what you can or cannot do about it.
Like I said at the topic of this blog post that the title is a bit more dramatic than is really the case 🙂
My next blog post will be on another question I’ve been asked a few times and this is ‘When I go to use the ODM tool it tells me that the Oracle Text feature of Oracle needs to be enabled’
ORE 1.4 New Parallel feature
Oracle R Enterprise (ORE) 1.4 has just been released and can downloaded from here. Remember there is a client and server side install required and ORE 1.4 is certified against R 3.0.1 and the Oracle R Distribution
One of the interesting new features is the PARALLEL option. You can set this to significantly improve the performance of your R server side code by using the PARALLEL database option. You can set the degree of PARALLEL at a global level in your code by using the ore.parallel setting.
The default setting for this ore.parallel setting is FALSE or 1. Otherwise it must be set to a minimum of 2 of more to enable the Parallel database option.
Alternatively you can set the ore.parallel setting to TRUE to use the default degree of parallelism that is set for the database object or set to NULL to use the default database setting
You will also be able to set the degree of parallel (DOP) using the parallel enabled functions ore.groupApply, ore.rowApply and ore.indexApply.
They have also made available or as they say exposed some more of the in-database Oracle Data Mining algorithms. These include the ODM algorithms for Association rules (ore.odmAssocRules), the feature extraction algorithm called Non-Negative Matrix Factorization (NMF) (ore.odmNMF) and the ODM Clustering algorithm O-Cluster (ore.odmOC)
Watch out of some blog posts on these over the coming weeks.
Check out the OTN page for the R Technologies from Oracle
ODM: Changing the bar chart format in Explore Node
In Oracle Data Miner you can use the Explore Node to gather an initial set of statistics for your dataset. As part of this you will also get a bar chart that shows the distributions of the values contained within each attribute. The following example shows the default layout of the bar charts.
These graphs a very useful for presenting the initial data exploration results from to your business users. In addition to these graphs you can also use the Graph node to give some additional graphical representations.
But the default bar chart that is produced by the Explore Node can appear to be a bit basic.
So what if we could change the layout to have a 3-D effect. People like 3-D bar charts.
Is this possible in Oracle Data Miner? If so then how can we do it?
Well it is possible and you can use the following steps to change your bar charts to 3-D.
To access the Explore Node settings go the the Tools menu and then select Preferences from the drop down menu.
Then the Preferences window opens scroll down to the Data Miner option and expand the available options.
The Explorer Data Viewer allows you to change the Precision settings. The section option is the Graphical Settings. You can change the Depth Radius setting. By default this is set to Zero. By increasing this value you can change the degree of the 3-D effect of the bar charts. You can also change the colour scheme too.
I’m not a fan of the other colour schemes that are available and mu favourite is still the default Nautical. The following bar chart is the same as the one at the top of this post but has the 3-D effect.
DBMS_PREDICTIVE_ANALYTICS & Predict
In this blog post I will look at the PREDICT procedure that is part of the DBMS_PREDICTIVE_ANALTYICS package. This package allows you to perform data mining in an automated way without having to go through the steps of building, testing and scoring data.
The predictive analytics procedures analyze and prepare the input data, create and test mining models using the input data, and then use the input data for scoring. The results of scoring are returned to the user. The models and supporting objects are not persisted and are removed from the database when the procedure is finished.
The PREDICT procedure should only be used for a Classification problem and data set.
The PREDICT procedure create a model based on the supplied data (out input table) and a target value, and returns scored data set in a new table. When using PREDICT you do not get to select an algorithm to use.
The input data source should contain records that already have the target value populated. It can also contain records where you do not have the target value. In this case the PREDICT function will use the records that have a target value to generate the model. This model will then score all records a the predicted target value
The syntax of the PREDICT procedure is:
DBMS_PREDICTIVE_ANALYTICS.PREDICT (
accuracy OUT NUMBER,
data_table_name IN VARCHAR2,
case_id_column_name IN VARCHAR2,
target_column_name IN VARCHAR2,
result_table_name IN VARCHAR2,
data_schema_name IN VARCHAR2 DEFAULT NULL);
Where
Parameter Name | Description |
accuracy | This output parameter from the procedure. You do not pass anything into this parameter. The Accuracy value returned is the predictive confidence of the model generated/used by the PREDICT procedure |
data_table_name | The name of the table that contains the data you want to use |
case_id_column_name | The case id for each record. This is unique for each record/case. |
target_column_name | The name of the column that contains the target column to be predicted |
result_table_name | The name of the table that will contain the results. This table should not exist in your schema, otherwise an error will occur. |
data_schema_name | The name of the schema where the table containing the input data is located. This is probably in your current schema, so you can leave this parameter NULL. |
The PREDICT procedure will produce an output tables (result_table_name parameter) and will contain 3 attributes.
CASE_ID | This is the Case Id of the record from the original data_table_name. This will allow you to link up the data in the source table to the prediction in the result_table_name |
PREDICTION | This will be the predicted value of the target attribute |
PROBABILITY | This is the probability of the prediction being correct |
Using the sample example data set that I have given in previous blog posts and in the blog post on the EXPLAIN procedure, the following code illustrates how to use the PREDICT procedure.
set serveroutput on
DECLARE
v_accuracy NUMBER(10,9);
BEGIN
DBMS_PREDICTIVE_ANALYTICS.PREDICT(
accuracy => v_accuracy,
data_table_name => ‘mining_data_build_v’,
case_id_column_name => ‘cust_id’,
target_column_name => ‘affinity_card’,
result_table_name => ‘PA_PREDICT’);
DBMS_OUTPUT.PUT_LINE(‘Accuracy of model = ‘ || v_accuracy);
END;
This took about 15 seconds to run on my laptop, which is surprisingly quick given all the work that is doing internally. To see the predictions and the results from the PREDICT procedure, you will need to query the PA_PREDICT table.
The final step that you might be interested in is to compare the original target value with the prediction value.
SELECT v.cust_id,
v.affinity_card,
p.prediction,
p.probability
FROM mining_data_build_v v,
pa_predict p
WHERE v.cust_id = p.cust_id
AND rownum <= 12;
Remember we do not get to see how or what Oracle did to generate these results. We do not get the opportunity to tune the process and the model.
So you have to be careful when you use the PREDICT function and on what data. Would you use this as a way to explore your data and to see if predictive analytics/data mining might be useful for your? Yes it would. Would you use it in a production scenario? the answer is maybe but it depends on the scenario. In reality if you want to do this in a production environment you will put some work into developing data mining models that best fit your data. To do this you will need to move onto the ODM tool and the DBMS_DATA_MINING package. But the PREDICT function is a quick way to get some small data scored (in some way) based on your existing data. If your marketing department says they want to start a tele marketing campaign in a couple of hours then PREDICT is what you need to use. It may not give you the most accurate of results, but it does give you results that you can start using quickly.
DBMS_PREDICTIVE_ANALYTICS & Explain
There are 2 PL/SQL packages for performing data mining/predictive analytics in Oracle. The main PL/SQL package is DBMS_DATA_MINING. This package allows you to build data mining models and to apply them to new data. But there is another PL/SQL package.
The DBMS_PREDICTIVE_ANALYTICS package is very different to the DBMS_DATA_MINING package. The DBMS_PREDICTIVE_ANALYTICS package includes routines for predictive analytics, an automated form of data mining. With predictive analytics, you do not need to be aware of model building or scoring. All mining activities are handled internally by the predictive analytics procedure.
Predictive analytics routines prepare the data, build a model, score the model, and return the results of model scoring. Before exiting, they delete the model and supporting objects.
The package comes with the following functions: EXPLAIN, PREDICT and PROFILE. To get some of details about these functions we can run the following in SQL.
This blog post will look at the EXPLAIN function.
EXPLAIN creates an attribute importance model. Attribute importance uses the Minimum Description Length algorithm to determine the relative importance of attributes in predicting a target value. EXPLAIN returns a list of attributes ranked in relative order of their impact on the prediction. This information is derived from the model details for the attribute importance model.
Attribute importance models are not scored against new data. They simply return information (model details) about the data you provide.
I’ve written two previous blog posts on Attribute Importance. One of these was on how to calculate Attribute Importance using the Oracle Data Miner tool. In the ODM tool it is now called Feature Selection and is part of the Filter Columns node and the Attribute Importance model is not persisted in the database. The second blog post was how you can create the Attribute Importance using the DBMS_DATA_MINING package.
EXPLAIN ranks attributes in order of influence in explaining a target column.
The syntax of the function is
DBMS_PREDICTIVE_ANALYTICS.EXPLAIN (
data_table_name IN VARCHAR2,
explain_column_name IN VARCHAR2,
result_table_name IN VARCHAR2,
data_schema_name IN VARCHAR2 DEFAULT NULL);
where
data_table_name =
Name of input table or view
explain_column_name =
Name of column to be explained
result_table_name =
Name of table where results are saved. It creates a new table in your schema.
data_schema_name =
Name of schema where the input table or view resides. Default: the current schema.
So when calling the function you do not have to include the last parameter.
Using the same example what I have given in the previous blog posts (see about for the links to these) the following command can be run to generate the Attribute Importance.
BEGIN
DBMS_PREDICTIVE_ANALYTICS.EXPLAIN(
data_table_name => ‘mining_data_build_v’,
explain_column_name => ‘affinity_card’,
result_table_name => ‘PA_EXPLAIN’);
END;
One thing that stands out is that it is a bit slower to run than the DBMS_DATA_MINING method. On my laptop it took approx. twice to three time longer to run. But in total it was less than a minute.
To display the results,
The results are ranked in a 0 to 1 range. Any attribute that had a negative value are set to zero.
Getting Real Business Value from Oracle Data Mining and OBIEE
Over the past 16 months (or so) I have give a join presentation with Anthony Heljula called ‘Getting Real Business Value from Oracle Data Mining and OBIEE’, at a number of conferences and OUG SIGs.
We have had a lot of very positive feedback on this presentation. The presentation is a busy 45 minutes (questions only at the end) that walks through a pilot data science project we did for a University in the UK.
We used Oracle Data Miner to build a predictive model that looks at student churn. We then integrated this Student Churn model into OBIEE Dashboards to illustrate how combining an Oracle Data Miner model into our data analysis we can gain a greater insight of our data.
We have submitted this presentation for Oracle Open World 2013 but we have renamed the title of the presentation to
“How UK Universities are using Oracle Data Science to protect their income”
If you are involved in presentation selection or know someone who is then maybe you might select this to be presented at OOW13 in September.
We submitted the presentation for OOW12 with not luck. So fingers crossed this time.
Oracle Data Miner Videos–Updated list
Over the past couple of week Charlie Berger has put together a few videos on Oracle Data Miner and has posted these on YouTube. Below are the links to these videos and to the YouTube videos I made back in 2011 on the Oracle Data Miner.
Oracle Data Miner Comes of Age – Brendan Tierney
ODM 11g R2 – Creating ODM User & Repository – Brendan Tierney
ODM 11gR2 – Exporting and Importing ODM Workflows – Brendan Tierney
ODM 11gR2 – Dropping the Repository – Brendan Tierney
I must get back to making a few more videos!
Overview presentation and demonstration of Oracle Advanced Analytics Option – by Charlie Berger
Fraud and Anomaly Detection using Oracle Advanced Anlaytics Part 1 Concepts – by Charlie Berger
Fraud and Anomaly Detection using Oracle Advanced Anlaytics Part 2 Demo – by Charlie Berger
Oracle buys Darwin back in 1999
The following is an extract from 1999 September/October edition of Oracle Magazine, about Oracle buying Thinking Machines. Their data mining software Darwin was integrated into the Oracle Database and renamed Oracle Data Miner.
“Oracle Corporation’s recent acquisition of Thinking Machines’ data mining business extends Oracle’s data warehouse platform and business intelligence solution to include enterprise reporting, ad hoc query, advanced analysis and data mining software based on a common internet platform.
Oracle plans to incorporate the data mining software as an integral feature of Oracle Applications Customer Relationship Management site, which will facilitate the implementation of the e0business solutions developed by Oracle customers. In addition o the software technology, Oracle will receive rights to the domains think.com and thinkingmachines.com.
About Thinking Machines
Originally founded in 1983, Thinking Machines Corporation revolutionized high performance computing with its massively parallel supercomputing technology. The company has since evolved to focus exclusively on its Darwin data mining software for database marketing in the financial services and telecommunications industries. Darwin analyzes massive volumes of customer transaction, demographic and psychographic data, which can often amount to hundreds of millions of customer data records.
These advanced analyses help companies profile and target customers with greater accuracy, which allows companies to reduce customer attrition, assess customer profitability, cross sell to existing customers and detect fraud.
Darwin puts powerful data mining techniques in the hands of general business users and experienced analysts alike. Each to use wizards automate data mining while providing advanced users with full control over all options and parameters. The Darwin software combines advanced analytics – including neural networks, decisions trees and memory based reasoning, with impressive power and performance.
The solution’s one button model code generation, powerful scripting language and robust software development kit bring prediction capabilities to sales, call center, marking and the web.
Platforms and Languages
Darwin runs on Sun Microsystems and Hewlett-Packard servers and exports data mining models in C, C++ and Java for execution within Oracle Databases. A Microsoft Windows NT release is planned for later this year.”
Part 2–Getting start with Statistics for Oracle Data Science projects
This is the second blog on getting started with Statistics for Oracle Data Science projects.
- The first blog post in the series looked at the DBMS_STAT_FUNCS PL/SQL package, what it can be used for and I give some sample code on how to use it in your data science projects. I also give some sample code that I typically run to gather some additional stats.
- The second blog post will look at some of the other statistical functions that exist in SQL that you will/may use regularly in your data science projects.
- The third blog post will provide a summary of the other statistical functions that exist in the database.
In this blog post I will look at 3 more useful statistical functions that are available in the Oracle database. Remember these come are standard with the database. The first function I will look at is the WIDTH_BUCKET function. This can be used to create some histograms of the data. A common task in analytics projects is to produce some cross tabs of the data. Oracle has the STATS_CROSSTAB. The last function I will look the different ways you an sample the data.
Histograms using WIDTH_BUCKET
When exploring your data it is useful to group values together into a number of buckets. Typically you might want to define the width of each bucket yourself before passing the data into your data mining tools, but before you can decide what these are you need to do some exploring using a variety of widths. A good way to do this is to use the WIDTH_BUCKET function. This takes the following inputs:
Expression: This is the expression or attribute on which the you want to build the histogram.
Min Value: This is the lower or starting value of the first bucket
Max Value: This is the last or highest value for the last bucket
Num Buckets: This is the number of buckets you want created.
Typically the Min Value and the Max Value can be calculated using the MIN and MAX functions. As a starting point you generally would select 10 for the number of buckets. This is the number you will change, downwards as well as upwards, to if a particular pattern exists in the attribute.
Using the example scenario that I used in the first blog post, let us start by calculating the MIN and MAX for the AGE attribute.
Lets say that we wanted to create 10 buckets. This would create a bucket width of 7.3 for each bucket, giving us the following.
Bucket 1 : 17-24.3
Bucket 2: 24.3-31.6
Bucket 3: 31.6-38.8
Bucket 4: 38.8-46.1
Bucket 5: 46.1-53.4
Bucket 6: 53.4-60.7
Bucket 7: 60.7-68
Bucket 8: 68-75.3
Bucket 9: 75.3-82.6
Bucket 10: 82.6-90
These are the buckets that the WIDTH_BUCKET function gives us in the following:
SELECT cust_id,
age,
width_bucket(age,
(SELECT min(age) from mining_data_build_v),
(select max(age)+1 from mining_data_build_v),
10) bucket
from mining_data_build_v
where rownum <=12
group by cust_id, age
An additional level of detail that is needed to allow us to plot the histograms for AGE, we need to aggregate up for all the records by bucket.
select intvl, count(*) freq
from (select width_bucket(salary,
(select min(salary) from employees),
(select max(salary)+1 from employees), 10) intvl
from HR.employees)
group by intvl
order by intvl;
We can take this code and embed it into the GATHER_DATA_STATS procedure that I gave in my Part 1 blog post.
Cross Tabs using STATS_CROSSTAB
Typically cross tabulation (or crosstabs for short) is a statistical process that summarises categorical data to create a contingency table. They provide a basic picture of the interrelation between two variables and can help find interactions between them.
Because Crosstabs creates a row for each value in one variable and a column for each value in the other, the procedure is not suitable for continuous variables that assume many values.
In Oracle we can perform crosstabs using one of their reporting tools. But if you don’t have one of these we will need to use the in-database function STATS_CROSSTAB. This function takes three parameters, the first two of these are the attributes you want to compare and the third is what test we want to perform. The tests available include:
- CHISQ_OBS: Observed value of chi-squared
- CHISQ_SIG: Significance of observed chi-squared
- CHISQ_DF: Degree of freedom for chi-squared
- PHI_COEFFICIENT: Phi coefficient
- CRAMERS_V: Cramer’s V statistic
- CONT_COEFFICIENT: Contingency coefficient
- COHENS_K: Cohen’s kappa
CHISQ_SIG is the default.
Now let us look at some examples using our same data set.
Sampling Data
When our datasets are of relatively small size consisting of a few hundred thousand records we can explore the data is a relatively short period of time. But if your data sets are larger that that you may need to explore the data by taking a sample of it. What sampling does is that it takes a “random” selection of records from our data set up to the new number of records we have specified in the sample.
In Oracle the SAMPLE function takes a percentage figure. This is the percentage of the entire data set you want to have in the Sampled result.
There is also a variant called SAMPLE BLOCK and the figure given is the percentage of records to select from each block.
Each time you use the SAMPLE function Oracle will generate a random seed number that it will use as a Seed for the SAMPLE function. If you omit a Seed number (like in the above examples), you will get a different result set in each case and the result set will have a slightly different number of records. If you run the sample code above over and over again you will see that the number of records returned varies by a small amount.
If you would like to have the same Sample data set returned each time then you will need to specify a Seed value. The Seed much be an integer between 0 and 4294967295.
In this case because we have specified the Seed we get the same “random” records being returned with each execution.
Part 1–Getting started with Statistics for Oracle Data Science projects
With all analytics or data science projects one of the first steps typically involves the extraction of data from various sources, merging the data and then performing various statistics.
The extraction and merging of the data is well covered by lots of other people blogging about how to use Oracle Data Integration (ODI), Oracle Warehouse Builder (OWB), among many others.
What I’m going to look at in this series of blog posts will be what statistical functions you might look at using in the Oracle and how to use them.
- This the first blog post in the series will look at the DBMS_STAT_FUNCS PL/SQL package, what it can be used for and I give some sample code on how to use it in your data science projects. I also give some sample code that I typically run to gather some additional stats.
- The second blog post will look at some of the other statistical functions that exist in SQL that you will/may use regularly in your data science projects.
- The third blog post will provide a summary of the other statistical functions that exist in the database.
These statistical functions can be grouped into 2 main types. The first is the descriptive statistics that are available by using the DBMS_STAT_FUNCS PL/SQL package and then there is the extensive list of other SQL stats functions. It is worth mentioning at this point that all these stats packages and functions come as standard in the database (i.e. they are FREE, you do not have to pay for an add on option for the database to use them). So once you have you Oracle database installed you can start using them. There is no need to spend money buying another stats package to do stats. All you need to know is some SQL and what the stats functions are.
DBMS_STAT_FUNCS
One of the stats package that I use a lot is the SUMMARY function. This is part of the DBMS_STAT_FUNC PL/SQL package. This package calculates a number of common statistics for an attribute in a table. Yes that’s correct, it only gather statistics for just one attribute. So you will have to run it for all the numeric attributes in the table.
For does people who are familiar with the Oracle Data Miner tool, the explore data node produces a lot of these statistics that the SUMMARY function produces. See below for details of how to produce the Histograms.
The SUMMARY function has the following parameters
Although you will probably be running this this function on the data in your schema you still have to give the schema name. The table name is the name of the table where the data exists, the column name is the name of the column that contains the actual data you want to analyse, and the ‘s’ is the record that will be returned by the SUMMARY function that contains all the summary information.
An example of the basic script to run the SUMMARY function is given below. It will use the data that is available in the sample schemas and the views that where setup for the Oracle Data Mining sample schemas. The table (or in this case the view) that we are going to use is the MINING_DATA_BUILD_V. What we are going to do is to replicate some of what the Explore Node does in the Oracle Data Miner tool.
set serveroutput on
declare
s DBMS_STAT_FUNCS.SummaryType;
begin
DBMS_STAT_FUNCS.SUMMARY(‘DMUSER’, ‘MINING_DATA_BUILD_V’, ‘AGE’, 3, s);
dbms_output.put_line(‘SUMMARY STATISTICS’);
dbms_output.put_line(‘Count : ‘||s.count);
dbms_output.put_line(‘Min : ‘||s.min);
dbms_output.put_line(‘Max : ‘||s.max);
dbms_output.put_line(‘Range : ‘||s.range);
dbms_output.put_line(‘Mean : ‘||round(s.mean));
dbms_output.put_line(‘Mode Count : ‘||s.cmode.count);
dbms_output.put_line(‘Mode : ‘||s.cmode(1));
dbms_output.put_line(‘Variance : ‘||round(s.variance));
dbms_output.put_line(‘Stddev : ‘||round(s.stddev));
dbms_output.put_line(‘Quantile 5 : ‘||s.quantile_5);
dbms_output.put_line(‘Quantile 25 : ‘||s.quantile_25);
dbms_output.put_line(‘Median : ‘||s.median);
dbms_output.put_line(‘Quantile 75 : ‘||s.quantile_75);
dbms_output.put_line(‘Quantile 95 : ‘||s.quantile_95);
dbms_output.put_line(‘Extreme Count : ‘||s.extreme_values.count);
dbms_output.put_line(‘Extremes : ‘||s.extreme_values(1));
dbms_output.put_line(‘Top 5 : ‘||s.top_5_values(1)||’,’||
s.top_5_values(2)||’,’||
s.top_5_values(3)||’,’||
s.top_5_values(4)||’,’||
s.top_5_values(5));
dbms_output.put_line(‘Bottom 5 : ‘||s.bottom_5_values(5)||’,’||
s.bottom_5_values(4)||’,’||
s.bottom_5_values(3)||’,’||
s.bottom_5_values(2)||’,’||
s.bottom_5_values(1));
end;
/
We can compare this to what is produced by the Explore Node in ODM
We can see that the Explore Node gives us more statistics to help us with understanding the data.
What Statistics does the Explore Node produce
We can see the actual SQL code that the Explore Node runs to get the statistics that are displayed in the Explore Node View Data window. To do this you will need to right-click on the Explore Node and move the mouse down to the Deploy option. The submenu will open and select ‘SQL to Clipboard’ from the list. Open a text editor and past the code. You will need to tidy up some of this code to point it at the actual data source you want. You will get the following
SELECT /*+ inline */ ATTR,
DATA_TYPE,
NULL_PERCENT,
DISTINCT_CNT,
DISTINCT_PERCENT,
MODE_VALUE,
AVG,
MIN,
MAX,
STD,
VAR,
SKEWNESS,
KURTOSIS,
HISTOGRAMS
FROM OUTPUT_1_23;
Where OUTPUT_1_23 is a working table that ODM has created to store intermediate results from some of its processing. In this case the Explore Node. You will need to change this to the ODM working table in your schema.
This query does not perform any of the statistics gathering. It just presents the results.
Creating our own Statistics gathering script – Part 1
The attribute names in the above SQL query tells us what statistics functions within Oracle that were used. We can replicate this statistics gathering task using the following script. There are four parts to this script. The first part gathers most of the common statistics for the attribute. The second and third parts calculate the Skewness and Kurtosis for the attribute. The final (fourth) part combines the first three parts and lists the outputs.
The one statistic function that we are not including at this point is the Histogram information. I will cover this in the next (second) blog post on statistics.
The following script has the data source table/view name included (MINING_DATA_BUILD_V) and the attribute we are going to use (AGE). You will need to modify this script to run it for each attribute.
WITH
basic_statistics AS (select (sum(CASE WHEN age IS NULL THEN 1 ELSE 0 END)/COUNT(*))*100 null_percent,
count(*) num_value,
count(distinct age) distinct_count,
(count(distinct age)/count(*))*100 distinct_percent,
avg(age) avg_value,
min(age) min_value,
max(age) max_value,
stddev(age) std_value,
stats_mode(age) mode_value,
variance(age) var_value
from mining_data_build_v),
skewness AS (select avg(SV) S_value
from (select power((age – avg(age) over ())/stddev(age) over (), 3) SV
from mining_data_build_v) ),
kurtosis AS (select avg(KV) K_value
from (select power((age – avg(age) over ())/stddev(age) over (), 4) KV
from mining_data_build_v) )
SELECT null_percent,
num_value,
distinct_percent,
avg_value,
min_value,
max_value,
std_value,
mode_value,
var_value,
S_value,
K_value
from basic_statistics,
skewness,
kurtosis;
Part 2 – Lets do it for all the attributes in a table
In the code above I’ve shown how you can gather the statistics for one particular attribute of one table.But in with an analytics project you will want to gather the statistics on all the attributes.
What we can do is to take the code above and put it into a procedure. This procedure accepts a table name as input, loops through the attributes for this table and calculates the various statistics. The statistics are saved in a table called DATA_STATS (see below).
drop table data_stats;
create table DATA_STATS (
table_name VARCHAR2(30) NOT NULL,
column_name VARCHAR2(30) NOT NULL,
data_type VARCHAR2(106) NOT NULL,
data_length NUMBER,
data_percision NUMBER,
data_scale NUMBER,
num_records NUMBER,
distinct_count NUMBER,
null_percent NUMBER,
distinct_percent NUMBER,
avg_value NUMBER,
min_value NUMBER,
max_value NUMBER,
std_value NUMBER,
mode_value VARCHAR2(1000),
var_value NUMBER,
s_value NUMBER,
k_value NUMBER,
PRIMARY KEY (table_name, column_name));
This is one of the first things that I do when I start on a new project. I create the DATA_STATS table and run my procedure GATHER_DATA_STATS for each table that we will be using. By doing this it allows me to have a permanent records of the stats for each attribute and saves me time in having to rerun various stats at different points of the project. I can also use these stats to produces some additional stats or to produce some graphs.
He is the code for the GATHER_DATA_STATS procedure.
CREATE OR REPLACE PROCEDURE gather_data_stats(p_table_name IN varchar2) AS
cursor c_attributes (c_table_name varchar2)
is SELECT table_name,
column_name,
data_type,
data_length,
data_precision,
data_scale
FROM user_tab_columns
WHERE table_name = upper(c_table_name);
v_sql NUMBER;
v_rows NUMBER;
BEGIN
dbms_output.put_line(‘Starting to gather statistics for ‘||upper(p_table_name)||’ at ‘||to_char(sysdate,’DD-MON-YY HH24:MI:SS’));
FOR r_att in c_attributes(p_table_name) LOOP
—
— remove any previously generated stats
—
v_sql := DBMS_SQL.OPEN_CURSOR;
DBMS_SQL.PARSE(v_sql, ‘delete from DATA_STATS where table_name = ”’||r_att.table_name||”’ and column_name = ”’||r_att.column_name||””, DBMS_SQL.NATIVE);
v_rows := DBMS_SQL.EXECUTE(v_sql);
–dbms_output.put_line(‘delete from DATA_STATS where table_name = ”’||r_att.table_name||”’ and column_name = ”’||r_att.column_name||””);
IF r_att.data_type = ‘NUMBER’ THEN
dbms_output.put_line(r_att.table_name||’ : ‘||r_att.column_name||’ : ‘||r_att.data_type);
—
— setup the insert statement and execute
—
v_sql := DBMS_SQL.OPEN_CURSOR;
DBMS_SQL.PARSE(v_sql, ‘insert into data_stats select ”’||r_att.table_name||”’, ”’||r_att.column_name||”’, ”’||r_att.data_type||”’, ‘||r_att.data_length||’, ‘||nvl(r_att.data_precision,0)||’, ‘||nvl(r_att.data_scale,0)||’, count(*) num_value, (sum(CASE WHEN ‘||r_att.column_name||’ IS NULL THEN 1 ELSE 0 END)/COUNT(*))*100 null_percent, count(distinct ‘||r_att.column_name||’) distinct_count, (count(distinct ‘||r_att.column_name||’)/count(*))*100 distinct_percent, avg(‘||r_att.column_name||’) avg_value, min(‘||r_att.column_name||’) min_value, max(‘||r_att.column_name||’) max_value, stddev(‘||r_att.column_name||’) std_value, stats_mode(‘||r_att.column_name||’) mode_value, variance(‘||r_att.column_name||’) var_value, null, null from ‘|| r_att.table_name, DBMS_SQL.NATIVE);
v_rows := DBMS_SQL.EXECUTE(v_sql);
ELSIF r_att.data_type IN (‘CHAR’, ‘VARCHAR’, ‘VARCHAR2’) THEN
dbms_output.put_line(r_att.table_name||’ : ‘||r_att.column_name||’ : ‘||r_att.data_type);
—
— We need to gather a smaller number of stats for the character attributes
—
v_sql := DBMS_SQL.OPEN_CURSOR;
begin
DBMS_SQL.PARSE(v_sql, ‘insert into data_stats select ”’||r_att.table_name||”’, ”’||r_att.column_name||”’, ”’||r_att.data_type||”’, ‘||r_att.data_length||’, ‘||nvl(r_att.data_precision,0)||’, ‘||nvl(r_att.data_scale,0)||’, count(*) num_value, (sum(CASE WHEN ‘||r_att.column_name||’ IS NULL THEN 1 ELSE 0 END)/COUNT(*))*100 null_percent, count(distinct ‘||r_att.column_name||’) distinct_count, (count(distinct ‘||r_att.column_name||’)/count(*))*100 distinct_percent, null, null, null, null, stats_mode(‘||r_att.column_name||’) mode_value, null, null, null from ‘|| r_att.table_name, DBMS_SQL.NATIVE);
v_rows := DBMS_SQL.EXECUTE(v_sql);
— dbms_output.put_line(‘insert into data_stats select ”’||r_att.table_name||”’, ”’||r_att.column_name||”’, ”’||r_att.data_type||”’, ‘||r_att.data_length||’, ‘||nvl(r_att.data_precision,0)||’, ‘||nvl(r_att.data_scale,0)||’, count(*) num_value, (sum(CASE WHEN ‘||r_att.column_name||’ IS NULL THEN 1 ELSE 0 END)/COUNT(*))*100 null_percent, count(distinct ‘||r_att.column_name||’) distinct_count, (count(distinct ‘||r_att.column_name||’)/count(*))*100 distinct_percent, null, null, null, null, stats_mode(‘||r_att.column_name||’) mode_value, null, null, null from ‘|| r_att.table_name);
exception
when others then
dbms_output.put_line(v_rows);
end;
ELSE
dbms_output.put_line(‘Unable to gather statistics for ‘||r_att.column_name||’ with data type of ‘||r_att.data_type||’.’);
END IF;
END LOOP;
dbms_output.put_line(‘Finished gathering statistics for ‘||upper(p_table_name)||’ at ‘||to_char(sysdate,’DD-MON-YY HH24:MI:SS’));
commit;
END;
Then to run it for a table:
exec gather_data_stats(‘mining_data_build_v’);
We can view the contents of the DATA_STATS table by executing the following in SQL*Plus or SQL Developer
select * from DATA_STATS;
Clustering in Oracle Data Miner–Part 4
This is a the fourth part of a five (5) part blog post on building and using Clustering in Oracle Data Miner. The following outlines the contents of each post in this series on Clustering.
- The first part we looked at what clustering features exist in ODM and how to setup the data that we will be using in the examples
- The second part will focus on how to building Clusters in ODM .
- The third post will focus on examining the clusters produced by ODM and how to use the Clusters to apply to new data using ODM.
- The fourth post will look at how you can build and evaluate a Clustering model using the ODM SQL and PL/SQL functions.
- The fifth and final post will look at how you can apply your Clustering model to new data using the ODM SQL and PL/SQL functions.
With Clustering we are trying to find hidden patterns in our data. Unlike classification we a not directing the algorithms on what areas/attributes to focus on.
In my preceding post on Clustering in Oracle Data Miner I showed how you can use the Oracle Data Miner (ODM) tool, that is part of SQL Developer, to build a cluster model and to apply it to new data.
As I have shown in previous blog posts we can do all of the same steps using SQL and PL/SQL. In this blog post I will show you how to to these SQL and PL/SQL functions to generate a Clustering model and how you can evaluate the model produced.
Step 1 – What Cluster Models to we already have
Oracle Data Miner comes with a number of data dictionary tables. We can use these to see what already exists in our schema. In the example below we will have a look to see the Cluster models that were produced in the Part 2 blog post.
format model_name format a20
format mining_function format a20
format algorithm format a20
SELECT model_name,
mining_function,
algorithm,
build_duration,
model_size
FROM ALL_MINING_MODELS
WHERE mining_function = ‘CLUSTERING’;
We can also look at the model settings that ODM produced. We can look at the settings of the K-Means model. This model (in my case) is called CLUS_KM_1_25.
column setting_value format a20
select setting_name,
setting_value,
setting_type
from all_mining_model_settings
where model_name = ‘CLUS_KM_1_25’
We can also look to see what attributes are used in the K-Mean model.
SELECT attribute_name,
attribute_type,
usage_type,
target
from all_mining_model_attributes
where model_name = ‘CLUS_KM_1_25’
I will show you how to use this Clustering model or the one that we are about to generate in my next blog post.
Step 2 – Creating the Setting table
When creating an Oracle Data Mining model in SQL you need to create Setting table. This will contain a record for each setting they you need to set for the algorithm you want to run. The settings table has a very basic structure and only consists of two columns. It is a good idea to create a separate Settings table for each algorithm or type of data mining you want to do. To create the settings table we run the following:
CREATE TABLE CLUSTER_SETTINGS (
Setting_Name VARCHAR2(30),
Setting_Value VARCHAR2(4000));
The next step is to define the setting we want to use. The first of these will be the actual algorithm you want to run. Oracle has two Clustering algorithms: algo_kmeans and algo_o_cluster. In the examples we are working through we are using the K-Mean algorithm and we want to set the number of clusters to 10. We are also going to set the Automatic Data Preparation (ADP) on. Here is the code.
BEGIN
INSERT INTO CLUSTER_SETTINGS (setting_name, setting_value)
VALUES (dbms_data_mining.algo_name, dbms_data_mining.ALGO_KMEANS);
INSERT INTO CLUSTER_SETTINGS (setting_name, setting_value)
VALUES (dbms_data_mining.prep_auto, dbms_data_mining.PREP_AUTO_ON);
INSERT INTO CLUSTER_SETTINGS (setting_name, setting_value)
VALUES (dbms_data_mining.clus_num_clusters, 10);
COMMIT;
END;
We can check to see that these records were inserted.
column setting_value format a25
select * from CLUSTER_SETTINGS;
The other settings will be dependent on what clustering algorithm is being used. Each one has their own setting. If you do not define any additional setting Oracle will use the in-build default settings.
To see what the default settings:
column setting_value format a35
select *
from table(dbms_data_mining.get_default_settings)
where setting_name like ‘KM%’;
If you want to use a different value to the default, just create a new record in the CLUSTER_SETTINGS table with the new value.
Step 3 – Create the new Cluster Model
We have the algorithm settings ready, so the next steps is to create the Cluster model. We will be using the DBMS_DATA_MINING.CREATE_MODEL function and pointing it to our CLUSTER_SETTINGS table.
BEGIN
DBMS_DATA_MINING.CREATE_MODEL(
model_name => ‘CLUSTER_KMEANS_MODEL’,
mining_function => dbms_data_mining.clustering,
data_table_name => ‘INSURANCE_CUST_LTV’,
case_id_column_name => ‘CUSTOMER_ID’,
target_column_name => null,
settings_table_name => ‘CLUSTER_SETTINGS’);
END;
This takes a few seconds to run on my laptop. When finished we can uses queries like we used in Step 1 above to find out the details of what was produced.
Step 4 – Examining the Cluster model
The first thing that we will need to check is that the model was created.
format model_name format a20
format mining_function format a15
format algorithm format a15
SELECT model_name,
mining_function,
algorithm,
build_duration,
model_size
FROM ALL_MINING_MODELS
WHERE mining_function = ‘CLUSTERING’;
One thing that is a bit curious is the mode size. We have basically created two K-means models using the same settings, but the final models have different sizes. See below for an explanation.
Now lets look at the algorithms settings for our new Cluster model (Cluster KMeans Model)
select setting_name,
setting_value,
setting_type
from all_mining_model_settings
where model_name = ‘CLUSTER_KMEANS_MODEL’
We can also look at the attributes used in the clusters.
SELECT attribute_name,
attribute_type,
usage_type,
target
from all_mining_model_attributes
where model_name = ‘CLUSTER_KMEANS_MODEL’
If we compare this list back to the list of attributes that were part of the model that we developed in the ODM tool, you will see that we have some extra attributes listed for the new K-means model.
The reason for these extra attributes and the bigger size (although still small) will be due to us having applied a sampling in the ODM tool. We took a subset of the data in the ODM tool and built the model based on this data. For the model we just created in SQL we took all the data (no sampling). That is why it would have looked at including more attributes because it was looking at a much bigger data set, etc.
Step 5 – Examining the Cluster Details
In this step we are going to look at some of the clusters produced by our new Cluster model.
We can produce the information relating to the clusters that were produced and what level in the hierarchy each cluster belongs. We can use this information to create/draw our own cluster hierarchy diagram.
set lines 90
column child format a40
select id,
record_count,
parent,
tree_level,
child
from table(dbms_data_mining.get_model_details_km(‘CLUSTER_KMEANS_MODEL’))
To look at the Centroid details for a particular cluster e.g. Cluster ID = 7, we can run:
column attribute_name format a25
column mode_value format a25
select t.id,
c.attribute_name,
c.mean,
c.mode_value,
c.variance
from table (dbms_data_mining.get_model_details_KM(‘CLUSTER_KMEANS_MODEL’)) t,
table(t.centroid) c
where t.id = 7
order by c.attribute_name
The results will contain details of the centroid for cluster number 7. The centroid details will include the attributes that were used to form the centroid and the values for each attribute.
The last thing we can look at is the rule for a cluster. The following query will give you the full details of the cluster rule for Cluster 7. This will give for each attribute what the values or ranges of values that is used for that cluster. I’ll let you run this as it returns a large number of records. Run it and have a look at some of the output.
select t.id,
a.attribute_name,
a.conditional_operator,
nvl(a.attribute_str_value,
a.attribute_num_value) AS value,
a.attribute_support,
a.attribute_confidence
from TABLE(dbms_data_mining.get_model_details_km(‘CLUSTER_KMEANS_MODEL’)) t,
TABLE(t.rule.antecedent) a
where t.id = 7
ORDER BY t.id, a.attribute_name, attribute_support, attribute_confidence desc, value;
My next blog post on Clustering in Oracle, will be the final one in the series. It will show you how to apply a Clustering model to new data in your database.
- ← Previous
- 1
- 2
- 3
- 4
- Next →
You must be logged in to post a comment.