Data Science

Evaluating Classification Results

Posted on

When you are working on building classification models you will need some ways of measuring the effectiveness of each model that you will build. This measurement/evaluation is perform during the model build process.

Typically the model build process consists of 2 steps (I’m assuming all data preparation etc has been completed:

  • Build the model: During this step you will feed in a portion of your data set to the data mining algorithm. Typical this data will be a subset of your data set and will typically consist of 60% to 70% of the data. This data is used to by the data mining algorithm to build the model.
  • Test the model: After the model has been built you will need to test the model to see how efficient it is at making the predictions. This is where we use the data that was not used to build the model. For this data we already know the outcome. So after we have applied the model to this data subset we can measure the predicted values against the actual values.

Most of the data mining tools will automate these two steps, specifically the splitting the data into the build and test data sets. But if you are using a language like R, etc then you will need to manually perform these steps.

The most common way of collating the test results is to use the Confusion Matrix. This allows us to layout the correct predictions, the incorrect predictions and to perform a number of other statistical measurements.

True Positives

True Negatives

False Positives

False Negatives

The last two of the above values are also commonly referred to in statistics as Type 1 (false positive) and Type 2 (false negative) errors.

Depending on your project you will concentrate on a combination of the true and false values of either the Positives or the negatives.

For example, in Medical Diagnostics for cancer, you will be looking to keep the False Negatives to a minimum. This is where you have predicted someone does not have cancer, but actually does. The consequence of this is that the person is not brought back for addition testing and we all know what will happen. On the other hand it is OK to have a hight False Positive in this case. In this scenario you bring the person back for additional tests and discover that they are all clear 🙂

Precision = How many of the selected items are relevant? (as a percentage)

Recall = How many of the relevant items are selected? (as a percentage)

Accuracy = How many did we correctly predict? (as a percentage)

The following table illustrates these measurements and tests.

Confusion Matrix

There are lots of other statistical tests that can be performed on your results. Everyone will have their own preferences. What I have highlighted here are the main statistical test for you to look at.

You cannot use one or a few of the statistical tests to make a decision on what data mining model works best for your data. It is a combination of these statistical test, your understanding of the data and you understanding of the business project that need to be considered.

In my next 2 blog posts I will show you how you can perform these tests on the results generated by the Oracle Data Miner tool and then on the Oracle Data Miner models produced using PL/SQL.

Advertisements

my Oracle Data Miner Book

Posted on

Some of you may be aware that I have been writing a on Oracle Data Miner. Actually the book covers the Oracle Data Miner GUI that is part of SQL Developer, the SQL and PL/SQL functions, procedures and packages that form the Oracle Data Mining option in the database and lots of other topics for the DBA, Developer and BI/DW people.
Today is a bit day for this book as it is officially released and available for purchase. See below for some links to where you can but the book in print and e-book formats. It has been published by McGraw-Hill/Oracle Press.
The book is aimed at a variety of people and the aim of the book is to introduce them to using the Oracle Data Miner tool and how to perform various data mining and predictive analytics tasks using SQL and PL/SQL.
The book will not teach you about how each of the data mining algorithms works. There is a bit of an assumption that you know a bit about these already. There are lots of books and resources about that cover that material. You can look on my book as an getting start / how to use type of book.
Below are are the images of the front cover and the back cover.
Book Cover            Book Back Cover
For more details of the book and for some updates keep an eye on my ODM Book page. On this page I’m adding a FAQ secion. This will be based on questions that I receive about the book.
If you buy the book then I hope you will find it helpful. If you are going to attend one of my presentations at an Oracle User Group meeting then bring the book along and I can sign it for you. Alternatively if you are at Oracle Open World 2014, come along to the Oracle Press Book Store, as I will be there to sign books on Wednesdays 1st October between 13:00 and 13:30.
Where can you Buy my Oracle Data Miner book (print and e-book).
You can buy the book from the McGraw-Hill/Oracle Press website and from Amazon. Each site will offer discounts so check out which one is the best for you.
McGraw-Hill/Oracle Press
For USA locations (enter promo code Tierney to save 20% and free delivery) www.mhprofessional.com
For UK & Ireland locations (enter promo code Tierney to save 20% and free delivery) www.mcgraw-hill.co.uk/tpr
Amazon
Click here to buy it on www.amazom.com
Click here to but it on www.amazon.co.uk

Oracle Advanced Analytics and Oracle Fusion Apps

Posted on

At a recent Oracle User Group conference, I was part of a round table discussion on Apps and BI. Unfortunately most of the questions were focused on Apps and the new Fusion Applications from Oracle. I mentioned that there was data mining functionality (using the Oracle Advanced Analytics Option) built into the Fusion Apps, it seems to come as a surprise to the Apps people. They were not aware of this built in functionality and capabilities. Well Oracle Data Mining and Oracle Advanced Analytics has been built into the following Oracle Fusion Applications.

  • Oracle Fusion HCM Workforce Predictions
  • Oracle Fusion CRM Sales Prediction Engine
  • Oracle Spend Classification
  • Oracle Sales Prospector
  • Oracle Adaptive Access Manager

Oracle Data Mining and Oracle Advanced Applications are also being used in the following applications:

  • Oracle Airline Data Model
  • Oracle Communications Data Model
  • Oracle Retail Data Model
  • Oracle Security Governor for Healthcare

I intend to submit a presentation on this topic to future Oracle User Group conferences as a way of spreading the Advanced Analytics message within the Oracle user community. If you would like me to present on this topic at your conference or SIG drop me an email and we can make the necessary arrangement 🙂

The ORE Packages

Posted on

If you are interested in using ORE or just to get an idea of what does ORE give you that does not already exist in one of the other R packages then the table below lists the packages that come as part of ORE.

Before you can use then you will need to load these into your workspace. To do this you can issue the following command from the R prompt or from the prompt in RStudio.

> library(ORE)

RStudio is my preferred R interface and is widely used around the world. table,th,td { border:1px solid black; border-collapse:collapse }

ORE Installed Packages Description
ORE Oracle R Enterprise
OREbase ORE – base
OREdm The ORE functions that use the in-database Oracle Data Miner algorithms
OREeda The ORE functions used for exploratory data analysis
OREgraphics The ORE functions used for graphics
OREpredict The ORE functions used for model predictions
OREstats The ORE stats functions
ORExml The ORE functions that convert R objects to XML
DBI R Database Interface
ROracle OCI based Oracle database interface for R
XML Tools for parsing and generating XML within R and S-Plus.
bitops Functions for Bitwise operations
png Read and write PNG images

In addition to these core ORE packages, ORE also uses some R packages as part of the core ORE packages listed above. The following table lists the R packages that are used in the ORE packages. So make sure you have these packages installed. They should have come with your installation of R, but if something has happened then you can download them again.

table,th,td { border:1px solid black; border-collapse:collapse }

R Packages used by ORE Description
base The R Base Package
boot Bootstrap Functions (originally by Angelo Canty for S)
class Functions for Classification
cluster Cluster Analysis Extended Rousseeuw et al
codetools Code Analysis Tools for R
compiler The R Compiler Package
datasets The R Datasets Package
foreign Read Data Stored by Minitab, S, SAS, SPSS, Stata, Systat, dBase, ..
graphics The R Graphics Package
grDevices The R Graphics Devices and Support for Colours and Fonts
grid The Grid Graphics Package
KernSmooth Functions for kernel smoothing for Wand & Jones (1995)
lattice Lattice Graphics
MASS Support Functions and Datasets for Venables and Ripley’s MASS
Matrix Sparse and Dense Matrix Classes and Methods
methods Formal Methods and Classes
mgcv GAMs with GCV/AIC/REML smoothness estimation and GAMMs by PQL
nlme Linear and Nonlinear Mixed Effects Models

I’ve been using R a lot over the past few years and I’ve had a number of projects involving R particularly over the past 12 month. I just found out that I will now have another short duration R project in May and June.

So watch out for lots more blog posts on R and ORE. Plus the usual blog posts on using Oracle Data Mining. ORE and Oracle Data Mining are very closely linked.

Gartner 2014 Advanced Analytics Quadrant

Posted on

The Gartner 2014 Advanced Analytics Quadrant is out now. Well it is if you can find it.

Some of the companies have put it up on their websites to promote their position.

For some reason Oracle hasn’t and I wonder why?

Gartner Advanced Analytics MQ Feb2014

You can see that some typical technologies are missing from this, but this is to be expected. How much are companies really deploying these alternatives on real problems and in production. Perhaps the positioning of Revolution Analysis might be an indicator. At some point there might be a shift from investigative analysis into more main stream projects and then into production.

What is still evident from this years quadrant is that SAS and IBM (SPSS) still have very dominant positions and perhaps will have for some time to come.

It will be interesting how this will all play out over the next few years.

Using the in-database ODM algorithms in ORE

Posted on

Oracle R Enterprise is the version of R that Oracle has that runs in the database instead of on your laptop or desktop.

Oracle already has a significant number of data mining algorithms in the database. With ORE they have exposed these so that they can be easily called from your R (ORE) scripts.

To access these in-database data mining algorithms you will need to use the ore.odm package.

ORE is continually being developed with new functionality being added all the time. Over the past 2 years Oracle have released and updated version of ORE about every 6 months. ORE is generally not certified with the latest version of R. But is slightly behind but only a point or two of the current release. For example the current version of ORE 1.4 (released only last week) is certified for R version 3.0.1. But the current release of R is 3.0.3.

Will ORE work with the latest version of R? The simple answer is maybe or in theory it should, but is not certified.

Let’s get back to ore.dm. The following table maps the ore.odm functions to the in-database Oracle Data Mining functions.

ORE Function Oracle Data Mining Algorithm What Algorithm can be used for
ore.odmAI Minimum Description Length Attribute Importance
ore.odmAssocRules Apriori Association Rules
ore.odmDT Decision Tree Classification
ore.odmGLM Generalized Linear Model Classification and Regression
ore.odmKMeans k-Means Clustering
ore.odmNB Naïve Bayes Classification
ore.odmNMF Non-Negative Matrix Factorization Feature Extraction
ore.odmOC O-Cluster Clustering
ore.odmSVM Support Vector Machines Classification and Regression

table,th,td { border:1px solid black; border-collapse:collapse }

As you can see we only have a subset of the in-database Oracle Dat Miner algorithms. This is a pity really, but I’m sure as we get newer releases of ORE these will be added.

ODM: Changing the bar chart format in Explore Node

Posted on

In Oracle Data Miner you can use the Explore Node to gather an initial set of statistics for your dataset. As part of this you will also get a bar chart that shows the distributions of the values contained within each attribute. The following example shows the default layout of the bar charts. Explore1

These graphs a very useful for presenting the initial data exploration results from to your business users. In addition to these graphs you can also use the Graph node to give some additional graphical representations.

But the default bar chart that is produced by the Explore Node can appear to be a bit basic.

So what if we could change the layout to have a 3-D effect. People like 3-D bar charts.

Is this possible in Oracle Data Miner? If so then how can we do it?

Well it is possible and you can use the following steps to change your bar charts to 3-D.

To access the Explore Node settings go the the Tools menu and then select Preferences from the drop down menu.

Explore2

Then the Preferences window opens scroll down to the Data Miner option and expand the available options.

Explore3

The Explorer Data Viewer allows you to change the Precision settings. The section option is the Graphical Settings. You can change the Depth Radius setting. By default this is set to Zero. By increasing this value you can change the degree of the 3-D effect of the bar charts. You can also change the colour scheme too.

Explore4

I’m not a fan of the other colour schemes that are available and mu favourite is still the default Nautical. The following bar chart is the same as the one at the top of this post but has the 3-D effect.

Explore5