oracle big data
Oracle Open World is fast approaching. Over the past couple of weeks I have been using the schedule builder tool to work out what sessions I would like to attend. Unfortunately there are LOTS of sessions I would love to attend but I haven’t worked out a way to be in 10 places at the same time.
When attending a conference I try to achieve a number of things. These are find out about new topics/features, benchmark my knowledge of existing topics, try some of the hands-on labs, try something new and do something completely different. This will be my challenge at Oracle Open World.
There are a number of other people from Ireland who will be attending OOW, and there are some plans to have an Ireland social event. Plus there is lots of meetings/catch-ups planned too with people I know from the virtual Oracle world.
There will be some people from AIB who will be presenting at OOW. Their presentation will be on the Tuesday morning 10:15-11:00. I’ll be there.
Other social things that are on include the Oracle ACE dinner, the Oracle Music festival, with Kings of Leon playing support to Pearl Jam on the Wednesdays night.
It is going to be a busy week, an enjoyable week, a week of learning new things and finding out lots of what the 12c database will be like.
Will I get time to go to everything ? The simple answer is NO. So I will just have to try to get to another Oracle Open World soon.
Here are the links to the 2 different sets of Big Data videos that Oracle have produced over the past 12 months
Oracle Big Data Videos – Version 1
Oracle Big Data Videos – Version 2
Other videos include
The content catalog for Oracle Open World 2012 was made public during the week. OOW is on between 30th September and 4th October.
The following table gives a list of most of the Data Analytics type sessions that are currently scheduled.
Why did I pick these sessions? If I was able to go to OOW then these are the sessions I would like to attend. Yes there would be many more sessions I would like to attend on the core DB technology and Development streams.
|CON6640 – Database Data Mining: Practical Enterprise R and Oracle Advanced Analytics||Husnu Sensoy|
|CON8688 – Customer Perspectives: Oracle Data Integrator||Gurcan Orhan – Software Architect & Senior Developer, Turkcell Technology R&D
Julien Testut – Product Manager, Oracle
|HOL10089 – Oracle Big Data Analytics and R||George Lumpkin – Vice President, Product Management, Oracle|
|CON8655 – Tackling Big Data Analytics with Oracle Data Integrator||Mala Narasimharajan – Senior Product Marketing Manager, Oracle
Michael Eisterer – Principal Product Manager, Oracle
|CON8436 – Data Warehousing and Big Data with the Latest Generation of Database Technology||George Lumpkin – Vice President, Product Management, Oracle|
|CON8424 – Oracle’s Big Data Platform: Settling the Debate||Martin Gubar – Director, Oracle
Kuassi Mensah – Director Product Management, Oracle
|CON8423 – Finding Gold in Your Data Warehouse: Oracle Advanced Analytics||Charles Berger – Senior Director, Product Management, Data Mining and Advanced Analytics, Oracle|
|CON8764 – Analytics for Oracle Fusion Applications: Overview and Strategy||Florian Schouten – Senior Director, Product Management/Strategy, Oracle|
|CON8330 – Implementing Big Data Solutions: From Theory to Practice||Josef Pugh – , Oracle|
|CON8524 – Oracle TimesTen In-Memory Database for Oracle Exalytics: Overview||Tirthankar Lahiri – Senior Director, Oracle|
|CON9510 – Oracle BI Analytics and Reporting: Where to Start?||Mauricio Alvarado – Principal Product Manager, Oracle|
|CON8438 – Scalable Statistics and Advanced Analytics: Using R in the Enterprise||Marcos Arancibia Coddou – Product Manager, Oracle Advanced Analytics, Oracle|
|CON4951 – Southwestern Energy’s Creation of the Analytical Enterprise||Jim Vick – , Southwestern Energy
Richard Solari – Specialist Leader, Deloitte Consulting LLP
|CON8311 – Mining Big Data with Semantic Web Technology: Discovering What You Didn’t Know||Zhe Wu – Consultant Member of Tech Staff, Oracle
Xavier Lopez – Director, Product Management, Oracle
|CON8428 – Analyze This! Analytical Power in SQL, More Than You Ever Dreamt Of||Hermann Baer – Director Product Management, Oracle
Andrew Witkowski – Architect, Oracle
|CON6143 – Big Data in Financial Services: Technologies, Use Cases, and Implications||Omer Trajman – , Cloudera
Ambreesh Khanna – Industry Vice President, Oracle
Sunil Mathew – Senior Director, Financial Services Industry Technology, Oracle
|CON8425 – Big Data: The Big Story||Jean-Pierre Dijcks – Sr. Principal Product Manager, Oracle|
|CON10327 – Recommendations in R: Scaling from Small to Big Data||Mark Hornick – Senior Manager, Oracle|
Download R : http://www.r-project.org/
R installation instructions : http://star-www.st-andrews.ac.uk/cran/
In previous post I gave the details of how you can use Regression in Oracle Data Miner to predict/forecast the lean of the tower in future years. This was based on building a regression model in ODM using the known lean/tilt of the tower for a range of years.
In this post I will show you how you can do the same tasks using the Oracle Data Miner functions in SQL and PL/SQL.
Step 1 – Create the table and data
The easiest way to do this is to make a copy of the PISA table we created in the previous blog post. If you haven’t completed this, then go to the blog post and complete step 1 and step 2.
create table PISA_2
as select * from PISA;
Step 2 – Create the ODM Settings table
We need to create a ‘settings’ table before we can use the ODM API’s in PL/SQL. The purpose of this table is to store all the configuration parameters needed for the algorithm to work. In our case we only need to set two parameters.
delete from pisa_2_settings;
INSERT INTO PISA_2_settings (setting_name, setting_value) VALUES
INSERT INTO PISA_2_settings (setting_name, setting_value) VALUES
Step 3 – Build the Regression Model
To build the regression model we need to use the CREATE_MODEL function that is part of the DBMS_DATA_MINING package. When calling this function we need to pass in the name of the model, the algorithm to use, the source data, the setting table and the target column we are interested in.
model_name => ‘PISA_REG_2’,
mining_function => dbms_data_mining.regression,
data_table_name => ‘pisa_2_build_v’,
case_id_column_name => null,
target_column_name => ’tilt’,
settings_table_name => ‘pisa_2_settings’);
After this we should have our regression model.
Step 4 – Query the Regression Model details
To find out what was produced as in the previous step we can query the data dictionary.
where model_name like ‘P%’;
where model_name like ‘P%’;
Step 5 – Apply the Regression Model to new data
Our final step would be to apply it to our new data i.e. the years that we want to know what the lean/tilt would be.
SELECT year_measured, prediction(pisa_reg_2 using *)
A few weeks ago I had a blog post called Domain Knowledge + Data Skills = Data Miner.
In that blog post I was saying that to be a Data Scientist all you needed was Domain Knowledge and some Data Skills, which included Data Mining.
The reality is that the skill set of a Data Scientist will be much larger. There is a saying ‘A jack of all trades and a master of none’. When it comes to being a data scientist you need to be a bit like this but perhaps a better saying would be ‘A jack of all trades and a master of some’.
I’ve put together the following diagram, which includes most of the skills with an out circle of more fundamental skills. It is this outer ring of skills that are fundamental in becoming a data scientist. The skills in the inner part of the diagram are skills that most people will have some experience in one or more of them. The other skills can be developed and learned over time, all depending on the type of person you are.
Can we train someone to become a data scientist or are they born to be a data scientist. It is a little bit of both really but you need to have some of the fundamental skills and the right type of personality. The learning of the other skills should be easy(ish)
What do you think? Are their Skill that I’m missing?
Over the past few weeks I have been talking to a lot of people who are looking at how data mining can be used in their organisation, for their projects and to people who have been doing data mining for a log time.
What comes across from talking to the experienced people, and these people are not tied to a particular product, is that you need to concentrate on the business problem. Once you have this well defined then you can drill down to the deeper levels of the project. Some of these levels will include what data is needed (not what data you have), tools, algorithms, etc.
Statistics is only a very small part of a data mining project. Some people who have PhDs in statistics who work in data mining say you do not use or very rarely use their statistics skills.
Some quotes that I like are:
“Focus hard on Business Question and the relevant target variable that captures the essence of the question.” Dean Abbott PAW Conf April 2012
“Find me something interesting in my data is a question from hell. Analysis should be guided by business goals.” Colin Shearer PAW Conf Oct 2011
There has need a lot of blog posting and articles on what are the key skills for a Data Miner and the more popular Data Scientist. What is very clear from all of these is that you will spend most of your time looking at, examining, integrating, manipulating, preparing, standardising and formatting the data. It has been quoted that all of these tasks can take up to 70% to 85% of a Data Mining/Data Scientist time. All of these tasks are commonly performed by database developers and in particular the developers and architects involved in Data Warehousing projects. The rest of the time for the running of the data mining algorithms, examining the results, and yes some stats too.
Every little time is spent developing algorithms!!! Why is this ? Would it be that the algorithms are already developed (for a long time now and are well turned) and available in all the data mining tools. We can almost treat these algorithms as a black box. So one of the key abilities of a data miner/data scientist would be to know what the algorithms can do, what kind of problems they can be used for, know what kind of outputs they produce, etc.
Domain knowledge is important, no matter how little it is, in preparing for and being involved in a data mining project. As we define our business problem the domain expert can bring their knowledge to the problem and allows us separate the domain related problems from the data related problems. So the domain expertise is critical at that start of a project, but the domain expertise is also critical when we have the outputs from the data mining algorithms. We can use the domain knowledge to tied the outputs from the data mining algorithms back to the original problem to bring real meaning to the original business problem we are working on.
So what is the formula of skill sets for a data mining or data scientist. Well it is a little like the title of this blog;
Domain Knowledge + Data Skills + Data Mining Skills + a little bit of Machine Learning + a little bit of Stats = a Data Miner / Data Scientist
There are a number of Oracle Advanced Analytics and related topics taking place this week at COLLABORATE12 in Las Vegas (http://collaborate12.com).
|Sun 22nd||9:00-3pm||Oracle Business Intelligence Application Journey|
|Mon 23rd||9:45-10:45||Managing Unstructured Data using Hadoop, Oracle 11g and Oracle Exadata Database Machine||Jim Steiner|
|Mon 23rd||9:45-10:45||Environmental Data Management and Analytics-a Real World Perspective||Angela Miller|
|Mon 23rd||11-12||Public Safety and Environmental Real-Time Analytics using Oracle Business Intelligence||Raghav Venkat
|Mon 23rd||11-12||BI is more than slice and dice||Peter Scott|
|Mon 23rd||14:30-15:30||In-Database Analytics: Predictive Analytics, Data Mining, Exadata & Business Intelligence||Jacek Myczkowski|
|Mon 23rd||15:45-16:45||Big Data Analytics, R you ready||Mark Hornick
|Tues 24th||10:45-11:45||BI Analytics and Oracle NoSQL. The Future of Now||Manish Khera|
|Wed. 25th||8:15-9:15||Oracle Data Mining – A Component of the Oracle Advanced Analytics Option-Hands-on Lab||Charlie Berger|
|Wed 25th||9:30-10:30||Oracle R Enterprise – A Component of the Oracle Advanced Analytics Option-Hands-on Lab||Mark Hornick|
Here are the abstracts from the two main Oracle Advanced Analytics presentations by Charlie Berger and Mark Hornick
Oracle Data Mining – A Component of the Oracle Advanced Analytics Option
This Hands-on Lab provides an introduction to Oracle Data Mining and the Oracle Data Miner GUI.
Oracle Data Mining (ODM), now part of Oracle Advanced Analytics, provides an extensive set of in-database data mining algorithms that solve a wide range of business problems. It can predict customer behavior, detect fraud, analyze market baskets, segment customers, and mine text to extract sentiments. ODM provides powerful data mining algorithms that run as native SQL functions for in-database model building and model deployment. There is no need for the time delays and security risks of data movement.
The free Oracle Data Miner GUI is an extension to Oracle SQL Developer 3.1 that enables data analysts to work directly with data inside the database, explore the data graphically, build and evaluate multiple data mining models, apply ODM models to new data, and deploy ODM’s predictions and insights throughout the enterprise. Oracle Data Miner work flows capture and document the user’s analytical methodology and can be saved and shared with others to automate advanced analytical methodologies.
Oracle R – A component of the Oracle Advanced Analytics Option
This Hands-on Lab provides an introduction to Oracle R Enterprise.
Oracle R Enterprise, a part of the Oracle Advanced Analytics Option, makes the open source R statistical programming language and environment ready for the enterprise by integrating R with Oracle Database. R users can interactively and transparently execute R scripts for statistical and graphical analyses on data stored in Oracle Database. R scripts can be executed in Oracle Database using potentially multiple database-managed R engines – resulting in data parallel execution. ORE also provides a rich set of statistical functions and advanced analytics techniques.
In this lab, attendees will be introduced to Oracle’s strategy for R, including the Oracle R Distribution, Oracle R Enterprise (ORE), and Oracle R Connector for Hadoop (ORCH). We will focus on Oracle R Enterprise with hands-on exercises exploring the transparency layer, embedded R execution, and statistics engine.
Here is a selection of videos and websites on Data Visualisations.
Hans Rosling videos of his TED talks
- World Population Growth
- Global Population Growth (TED)
- Asia’s Rise – How and When
- HIV: New facts and stunning data visuals
- Video for the BBC
Charlie Berger (Sr. Director Product Management, Data Mining & Advanced Analytics) as produced a video based on a recent presentation called ‘Oracle Advanced Analytics: Oracle R Enterprise & Oracle Data Mining’.
This is a 1 hour video, including some demos, of product background, product features, recent developments and new additions, examples of how Oracle is including Oracle Data Mining into their fusion applications, etc.
Oracle has 2 data mining products, with main in-database Oracle Data Mining and the more recent extensions to R to give us Oracle R Enterprise.
Check out the video – Click here.
Check out Charlie’s blog at https://blogs.oracle.com/datamining/
Oracle University : 2 Day Oracle Data Mining training course
In a previous blog post I explained what attribute importance is and how it can be used in the Oracle Data Miner tool (click here to see blog post).
In this post I want to show you how to perform the same task using the ODM PL/SQL API.
The ODM tool makes extensive use of the Automatic Data Preparation (ADP) function. ADP performs some data transformations such as binning, normalization and outlier treatment of the data based on the requirements of each of the data mining algorithms. In addition to these transformations we can specify our own transformations. We do this by creating a setting tables which will contain the settings and transformations we can the data mining algorithm to perform on the data.
ADP is automatically turned on when using the ODM tool in SQL Developer. This is not the case when using the ODM PL/SQL API. So before we can run the Attribute Importance function we need to turn on ADP.
Step 1 – Create the setting table
CREATE TABLE Att_Import_Mode_Settings (
Step 2 – Turn on Automatic Data Preparation
INSERT INTO Att_Import_Mode_Settings (setting_name, setting_value)
Step 3 – Run Attribute Importance
model_name => ‘Attribute_Importance_Test’,
mining_function => DBMS_DATA_MINING.ATTRIBUTE_IMPORTANCE,
data_table_name > ‘mining_data_build_v’,
case_id_column_name => ‘cust_id’,
target_column_name => ‘affinity_card’,
settings_table_name => ‘Att_Import_Mode_Settings’);
Step 4 – Select Attribute Importance results
ORDER BY RANK;
ATTRIBUTE_NAME IMPORTANCE_VALUE RANK
——————– —————- ———-
HOUSEHOLD_SIZE .158945397 1
CUST_MARITAL_STATUS .158165841 2
YRS_RESIDENCE .094052102 3
EDUCATION .086260794 4
AGE .084903512 5
OCCUPATION .075209339 6
Y_BOX_GAMES .063039952 7
HOME_THEATER_PACKAGE .056458722 8
CUST_GENDER .035264741 9
BOOKKEEPING_APPLICAT .019204751 10
CUST_INCOME_LEVEL 0 11
BULK_PACK_DISKETTES 0 11
OS_DOC_SET_KANJI 0 11
PRINTER_SUPPLIES 0 11
COUNTRY_NAME 0 11
FLAT_PANEL_MONITOR 0 11
Oracle R Enterprise (ORE) was officially launched over the past couple of days and it has been receiving a lot of interest in the press.
We now have the Oracle Advanced Analytics (OAA) option which comprises, the already existing, Oracle Data Mining and now Oracle R Enterprise. In addition to the Oracle Advanced Analytics option we also 2 free set of tools available to use to use. The first of these free tools are the statistical functions which are available in all versions of the Oracle Database and the second free tool is the Oracle Data Miner tool that is part of the newly released SQL Developer 3.1 (7th Feb).
What has Oracle done to Oracle to make Oracle R Enterprise ?
The one of the main challenges with using R is that it is memory constrained, resulting in the amount of data that it can process. So the ORE development team have worked ensuring R can work transparently with data within the database. This removes the need extract the data from the database before it can be used by R. We still get all the advanced on in-Database Data Mining.
They have also embedded R functions within the database, so we an run R code on data within the database. By having these functions with the database, this allows R to use the database parallelism and so we get quicker execution of our code. Most R implementation are constrained to being able to process dataset containing 100Ks of records. With ORE we can now process 10M+ records
In addition to the ORE functions and algorithms that are embedded in the database we can also use the R code to call the suite of data mining algorithms that already exist as part of Oracle Data Miner.
For more details of what Oracle R Enterprise is all about check out the following links.