data mining

Using ODM Regression for the Leaning Tower of Pisa tilt problem

Posted on

This blog post will look at how you can use the Regression feature in Oracle Data Miner (ODM) to predict the lean/tilt of the Leaning Tower of Pisa in the future.

This is a well know regression exercise, and it typically comes with a set of know values and the year for these values. There are lots of websites that contain the details of the problem. A summary of it is:

The following table gives measurements for the years 1975-1985 of the “lean” of the Leaning Tower of Pisa. The variable “lean” represents the difference between where a point on the tower would be if the tower were straight and where it actually is. The data is coded as tenths of a millimetre in excess of 2.9 meters, so that the 1975 lean, which was 2.9642.

Given the lean for the years 1975 to 1985, can you calculate the lean for a future date like 200, 2009, 2012.

Step 1 – Create the table

Connect to a schema that you have setup for use with Oracle Data Miner. Create a table (PISA) with 2 attributes, YEAR_MEASURED and TILT. Both of these attributes need to have the datatype of NUMBER, as ODM will ignore any of the attributes if they are a VARCHAR or you might get an error.

CREATE TABLE PISA
  (
    YEAR_MEASURED NUMBER(4,0),
    TILT          NUMBER(9,4)
);

Step 2 – Insert the data

There are 2 sets of data that need to be inserted into this table. The first is the data from 1975 to 1985 with the known values of the lean/tilt of the tower. The second set of data is the future years where we do not know the lean/tilt and we want ODM to calculate the value based on the Regression model we want to create.

Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1975,2.9642);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1976,2.9644);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1977,2.9656);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1978,2.9667);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1979,2.9673);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1980,2.9688);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1981,2.9696);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1982,2.9698);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1983,2.9713);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1984,2.9717);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1985,2.9725);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1986,2.9742);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1987,2.9757);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1988,null);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1989,null);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1990,null);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1995,null);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (2000,null);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (2005,null);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (2010,null);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (2009,null);

Step 3 – Start ODM and Prepare the data

Open SQL Developer and open the ODM Connections tab. Connect to the schema that you have created the PISA table in. Create a new Project or use an existing one and create a new Workflow for your PISA ODM work.

Create a Data Source node in the workspace and assign the PISA table to it. You can select all the attributes..

The table contains the data that we need to build our regression model (our training data set) and the data that we will use for predicting the future lean/tilt (our apply data set).

We need to apply a filter to the PISA data source to only look at the training data set. Select the Filter Rows node and drag it to the workspace. Connect the PISA data source to the Filter Rows note. Double click on the Filter Row node and select the Expression Builder icon. Create the where clause to select only the rows where we know the lean/tilt.

image

image

Step 4 – Create the Regression model

Select the Regression Node from the Models component palette and drop it onto your workspace. Connect the Filter Rows node to the Regression Build Node.

image

Double click on the Regression Build node and set the Target to the TILT variable. You can leave the Case ID at .  You can also select if you want to build a GLM or SVM regression model or both of them. Set the AUTO check box to unchecked. By doing this Oracle will not try to do any data processing or attribute elimination.

image

You are now ready to create your regression models.

To do this right click the Regression Build node and select Run. When everything is finished you will get a little green tick on the top right hand corner of each node.

image

Step 5 – Predict the Lean/Tilt for future years

The PISA table that we used above, also contains our apply data set

image

We need to create a new Filter Rows node on our workspace. This will be used to only look at the rows in PISA where TILT is null.  Connect the PISA data source node to the new filter node and edit the expression builder.

image

Next we need to create the Apply Node. This allows us to run the Regression model(s) against our Apply data set. Connect the second Filter Rows node to the Apply Node and the Regression Build node to the Apply Node.

image

Double click on the Apply Node.  Under the Apply Columns we can see that we will have 4 attributes created in the output. 3 of these attributes will be for the GLM model and 1 will be for the SVM model.

Click on the Data Columns tab and edit the data columns so that we get the YEAR_MEASURED attribute to appear in the final output.

Now run the Apply node by right clicking on it and selecting Run.

Step 6 – Viewing the results

Where we get the little green tick on the Apply node we know that everything has run and completed successfully.

image

To view the predictions right click on the Apply Node and select View Data from the menu.

image

We can see the the GLM mode gives the results we would expect but the SVM does not.

Data Science Is Multidisciplinary

Posted on

A few weeks ago I had a blog post called Domain Knowledge + Data Skills = Data Miner.

In that blog post I was saying that to be a Data Scientist all you needed was Domain Knowledge and some Data Skills, which included Data Mining.

The reality is that the skill set of a Data Scientist will be much larger. There is a saying ‘A jack of all trades and a master of none’. When it comes to being a data scientist you need to be a bit like this but perhaps a better saying would be ‘A jack of all trades and a master of some’.

I’ve put together the following diagram, which includes most of the skills with an out circle of more fundamental skills. It is this outer ring of skills that are fundamental in becoming a data scientist. The skills in the inner part of the diagram are skills that most people will have some experience in one or more of them. The other skills can be developed and learned over time, all depending on the type of person you are.

image

Can we train someone to become a data scientist or are they born to be a data scientist. It is a little bit of both really but you need to have some of the fundamental skills and the right type of personality. The learning of the other skills should be easy(ish)

What do you think?  Are their Skill that I’m missing?

VM for Oracle Data Miner

Posted on

Recently the OTN team have updated the ‘Database App Development’  Developer Day virtual machine to include Oracle 11.2.0.2 DB and SQL Developer 3.1.  This is all you need to try out Oracle Data Miner.

So how do you get started with using Oracle Data Miner on your PC. The first step is to download and install the latest version of Oracle VirtualBox.

image

The next step is to download and install the OTN Developer Day appliance. Click on the above link to go to the webpage and follow the instructions to download and install the appliance. Download the first appliance on this page ‘Database App Development’ VM. This is a large download and depending on your internet connection it can take anything from 30 minutes to hours. So I wouldn’t recommend doing this over a wifi.

When you start up the VM your OS username and password is oracle. Yes it is case sensitive.

When the get logged into the VM you can close or minimise the host window

image

There are two important icons, the SQL Developer and the ODDHandsOnLab.html icons.

The ODDHandsOnLab.html icon loads a webpage what contains a number of tutorials for you to follow.

image

The tutorial we are interest in is the Oracle Data Miner Tutorial. There are 4 tutorials given for ODM. The first two tutorials need to be followed in the order that they are given. The second two tutorials can be done in any order.

If you have not used SQL Developer before then you should work through this tutorial before starting the Oracle Data Miner tutorials.

The first tutorial takes you through the steps needed to create your ODM schema and to create the ODM repository within the database. This tutorial will only take you 10 to 15 minutes to complete.

In the second tutorial you get to use the ODM to build your first ODM model. This tutorial steps your through how to get started with an ODM project, workflow, the different ODM features, how to explore the data, how to create classification models, how to explore the model and then how to apply one of these models to new data. This second tutorial will take approx. 30 to 40 minutes to complete.

It is all very simple and easy to use.

Domain Knowledge + Data Skills = Data Miner

Posted on

Over the past few weeks I have been talking to a lot of people who are looking at how data mining can be used in their organisation, for their projects and to people who have been doing data mining for a log time.

What comes across from talking to the experienced people, and these people are not tied to a particular product, is that you need to concentrate on the business problem. Once you have this well defined then you can drill down to the deeper levels of the project. Some of these levels will include what data is needed (not what data you have), tools, algorithms, etc.

Statistics is only a very small part of a data mining project. Some people who have PhDs in statistics who work in data mining say you do not use or very rarely use their statistics skills.

Some quotes that I like are:

Focus hard on Business Question and the relevant target variable that captures the essence of the question.” Dean Abbott PAW Conf April 2012

Find me something interesting in my data is a question from hell. Analysis should be guided by business goals.” Colin Shearer PAW Conf Oct 2011

There has need a lot of blog posting and articles on what are the key skills for a Data Miner and the more popular Data Scientist. What is very clear from all of these is that you will spend most of your time looking at, examining, integrating, manipulating, preparing, standardising and formatting the data. It has been quoted that all of these tasks can take up to 70% to 85% of a Data Mining/Data Scientist time. All of these tasks are commonly performed by database developers and in particular the developers and architects involved in Data Warehousing projects. The rest of the time for the running of the data mining algorithms, examining the results, and yes some stats too.

Every little time is spent developing algorithms!!! Why is this ? Would it be that the algorithms are already developed (for a long time now and are well turned) and available in all the data mining tools. We can almost treat these algorithms as a black box. So one of the key abilities of a data miner/data scientist would be to know what the algorithms can do, what kind of problems they can be used for, know what kind of outputs they produce, etc.

Domain knowledge is important, no matter how little it is, in preparing for and being involved in a data mining project. As we define our business problem the domain expert can bring their knowledge to the problem and allows us separate the domain related problems from the data related problems. So the domain expertise is critical at that start of a project, but the domain expertise is also critical when we have the outputs from the data mining algorithms. We can use the domain knowledge to tied the outputs from the data mining algorithms back to the original problem to bring real meaning to the original business problem we are working on.

So what is the formula of skill sets for a data mining or data scientist. Well it is a little like the title of this blog;

Domain Knowledge + Data Skills + Data Mining Skills + a little bit of Machine Learning + a little bit of Stats = a Data Miner / Data Scientist

2 Day Oracle Data Miner course material

Posted on

Last week I managed to get my hands on the training material for the 2 Day Oracle Data Miner course. This course is run by Oracle University.

Many thanks to Michael O’Callaghan who is a BI Sales person here in Ireland and Oracle University, for arranging this.

The 2 days are pretty packed with a mixture of lecture type material, lots of hands on exercises and some time for open discussions. In particular, day 2 will be very busy day.

Check out the course outline and published schedule – click here

You can have this course on site at your organisation. If this is something that interests you then contact your Oracle University account manager. There is also the traditional face-to-face delivery and the newer online delivery, where people from around the world come together for the online class.

Oracle Advanced Analytics Video by Charlie Berger

Posted on

Charlie Berger (Sr. Director Product Management, Data Mining & Advanced Analytics) as produced a video based on a recent presentation called ‘Oracle Advanced Analytics: Oracle R Enterprise & Oracle Data Mining’.

This is a 1 hour video, including some demos, of product background, product features, recent developments and new additions, examples of how Oracle is including Oracle Data Mining into their fusion applications, etc.

Oracle has 2 data mining products, with main in-database Oracle Data Mining and the more recent extensions to R to give us Oracle R Enterprise.

Check out the video – Click here.

Check out Charlie’s blog at https://blogs.oracle.com/datamining/

Oracle University : 2 Day Oracle Data Mining training course

2 Day Oracle Data Miner training course by Oracle University

Posted on

In the past few days Oracle University has advertised a new 2 Day instructor led training course on Oracle Data Miner.

There are no advertised dates or locations for this course yet. I suppose it will depend on the level of interest in the product.

There is the overview from the Oracle University webpage

In this course, students review the basic concepts of data mining and learn how leverage the predictive analytical power of the Oracle Database Data Mining option by using Oracle Data Miner 11g Release 2. The Oracle Data Miner GUI is an extension to Oracle SQL Developer 3.0 that enables data analysts to work directly with data inside the database.

The Data Miner GUI provides intuitive tools that help you to explore the data graphically, build and evaluate multiple data mining models, apply Oracle Data Mining models to new data, and deploy Oracle Data Mining’s predictions and insights throughout the enterprise. Oracle Data Miner’s SQL APIs automatically mine Oracle data and deploy results in real-time. Because the data, models, and results remain in the Oracle Database, data movement is eliminated, security is maximized and information latency is minimized

Click on the following link to access the details of the training course

http://education.oracle.com/pls/web_prod-plq-dad/db_pages.getCourseDesc?dc=D73528GC10

To view a PDF of the course details – click here

ODM–Attribute Importance using PL/SQL API

Posted on

In a previous blog post I explained what attribute importance is and how it can be used in the Oracle Data Miner tool (click here to see blog post).

In this post I want to show you how to perform the same task using the ODM PL/SQL API.

The ODM tool makes extensive use of the Automatic Data Preparation (ADP) function. ADP performs some data transformations such as binning, normalization and outlier treatment of the data based on the requirements of each of the data mining algorithms. In addition to these transformations we can specify our own transformations.  We do this by creating a setting tables which will contain the settings and transformations we can the data mining algorithm to perform on the data.

ADP is automatically turned on when using the ODM tool in SQL Developer. This is not the case when using the ODM PL/SQL API. So before we can run the Attribute Importance function we need to turn on ADP.

Step 1 – Create the setting table

CREATE TABLE Att_Import_Mode_Settings (
  setting_name  VARCHAR2(30),
  setting_value VARCHAR2(30));

Step 2 – Turn on Automatic Data Preparation

BEGIN
   INSERT INTO Att_Import_Mode_Settings (setting_name, setting_value)
   VALUES (dbms_data_mining.prep_auto,dbms_data_mining.prep_auto_on);
  COMMIT;
END;

Step 3 – Run Attribute Importance

BEGIN
  DBMS_DATA_MINING.CREATE_MODEL(
    model_name => ‘Attribute_Importance_Test’,
    mining_function  => DBMS_DATA_MINING.ATTRIBUTE_IMPORTANCE,
    data_table_name  > ‘mining_data_build_v’,
    case_id_column_name => ‘cust_id’,
    target_column_name  => ‘affinity_card’,
    settings_table_name => ‘Att_Import_Mode_Settings’);
END;

Step 4 – Select Attribute Importance results

SELECT *
FROM TABLE(DBMS_DATA_MINING.GET_MODEL_DETAILS_AI(‘Attribute_Importance_Test’))
ORDER BY RANK;

ATTRIBUTE_NAME       IMPORTANCE_VALUE       RANK
——————– —————- ———-
HOUSEHOLD_SIZE             .158945397          1
CUST_MARITAL_STATUS        .158165841          2
YRS_RESIDENCE              .094052102          3
EDUCATION                  .086260794          4
AGE                        .084903512          5
OCCUPATION                 .075209339          6
Y_BOX_GAMES                .063039952          7
HOME_THEATER_PACKAGE       .056458722          8
CUST_GENDER                .035264741          9
BOOKKEEPING_APPLICAT       .019204751         10
ION

CUST_INCOME_LEVEL                   0         11
BULK_PACK_DISKETTES                 0         11
OS_DOC_SET_KANJI                    0         11
PRINTER_SUPPLIES                    0         11
COUNTRY_NAME                        0         11
FLAT_PANEL_MONITOR                  0         11

What has Oracle done to R to give us ORE

Posted on

Oracle R Enterprise (ORE) was officially launched over the past couple of days and it has been receiving a lot of interest in the press.

We now have the Oracle Advanced Analytics (OAA) option which comprises, the already existing, Oracle Data Mining and now Oracle R Enterprise. In addition to the Oracle Advanced Analytics option we also 2 free set of tools available to use to use. The first of these free tools are the statistical functions which are available in all versions of the Oracle Database and the second free tool is the Oracle Data Miner tool that is part of the newly released SQL Developer 3.1 (7th Feb).

What has Oracle done to Oracle to make Oracle R Enterprise ?

The one of the main challenges with using R is that it is memory constrained, resulting in the amount of data that it can process. So the ORE development team have worked ensuring R can work transparently with data within the database. This removes the need extract the data from the database before it can be used by R. We still get all the advanced on in-Database Data Mining.

They have also embedded R functions within the database, so we an run R code on data within the database. By having these functions with the database, this allows R to use the database parallelism and so we get quicker execution of our code. Most R implementation are constrained to being able to process dataset containing 100Ks of records. With ORE we can now process 10M+ records

In addition to the ORE functions and algorithms that are embedded in the database we can also use the R code to call the suite of data mining algorithms that already exist as part of Oracle Data Miner.

For more details of what Oracle R Enterprise is all about check out the following links.

Oracle Advanced Analytics Options website

ORE Webpage

ORE Blog

ORE Download

ORE Forum

ODM 11gR2–Attribute Importance

Posted on

I had a previous blog post on Data Exploration using Oracle Data Miner 11gR2. This blog post builds on the steps illustrated in that blog post.

After we have explored the data we can identity some attributes/features that have just one value or mainly one value, etc.  In most of these cases we know that these attributes will not contribute to the model build process.

In our example data set we have a small number of attributes. So it is easy to work through the data and get a good understanding of some of the underlying information that exists in the data. Some of these were pointed out in my previous blog post.

The reality is that our data sets can have a large number of attributes/features. So it will be very difficult or nearly impossible to work through all of these to get a good understanding of what is a good attribute to use, and keep in our data set, or what attribute does not contribute and should be removed from the data set.

Plus as our data evolves over time, the importance of the attributes will evolve with some becoming less important and some becoming more important.

The Attribute Importance node in Oracle Data Miner allows use to automate this work for us and can save us many hours or even days, in our work on this task.

The Attribute Importance node using the Minimum Description Length algorithm.

The following steps, builds on our work in my previous post, and shows how we can perform Attribute Importance on our data.

1. In the Component Palette, select Filter Columns from the Transforms list

2. Click on the workflow beside the data node.

3. Link the Data Node to the Filter Columns node. Righ-click on the data node, select Connect, move the mouse to the Filter Columns node and click. the link will be created

image

4. Now we can configure the Attribute Importance settings.Click on the Filter Columns node. In the Property Inspector, click on the Filters tab.

– Click on the Attribute Importance Checkbox

– Set the Target Attribute from the drop down list. In our data set this is Affinity Card

5. Right click the Filter Columns node and select Run from the menu

After everything has run, we get the little green box with the tick mark on the Filter Column node. To view the results we right clicking on the Filter Columns node and select View Data from the menu. We get the list of attributes listed in order of importance and their Importance measure.

image

We see that there are a number of attributes that have a zero value. It algorithm has worked out that these attributes would not be used in the model build step. If we look back to the previous blog post, some of the attributes we identified in it have also been listed here with a zero value.

ODM 11gR2–Real-time scoring of data

Posted on

In my previous posts I gave sample code of how you can use your ODM model to score new data.

Applying an ODM Model to new data in Oracle – Part 2

Applying an ODM Model to new data in Oracle – Part 1

The examples given in this previous post were based on the new data being in a table.

In some scenarios you may not have the data you want to score in table. For example you want to score data as it is being recorded and before it gets committed to the database.

The format of the command to use is

prediction(ODM_MODEL_NAME USING )

prediction_probability(ODM_Model_Name, Target Value, USING )

So we can list the model attributes we want to use instead of using the USING *  as we did in the previous blog posts

Using the same sample data that I used in my previous posts the command would be:

Select prediction(clas_decision_tree
USING
20 as age,
‘NeverM’ as cust_marital_status,
‘HS-grad’ as education,
1 as household_size,
2 as yrs_residence,
1 as y_box_games) as scored_value
from dual;

SCORED_VALUE
————
           0

Select prediction_probability(clas_decision_tree, 0
USING
20 as age,
‘NeverM’ as cust_marital_status,
‘HS-grad’ as education,
1 as household_size,
2 as yrs_residence,
1 as y_box_games) as probability_value
from dual;

PROBABILITY_VALUE
—————–
                1

So we get the same result as we got in our previous examples.

Depending of what data we have gathered we may or may not have all the values for each of the attributes used in the model. In this case we can submit a subset of the values to the function and still get a result.

Select prediction(clas_decision_tree
USING
20 as age,
‘NeverM’ as cust_marital_status,
‘HS-grad’ as education) as scored_value2
from dual;

SCORED_VALUE2
————-
            0

Select prediction_probability(clas_decision_tree, 0
USING
20 as age,
‘NeverM’ as cust_marital_status,
‘HS-grad’ as education) as probability_value2
from dual;

PROBABILITY_VALUE2
——————
                 1

Again we get the same results.