Here are the links to the 2 different sets of Big Data videos that Oracle have produced over the past 12 months
Oracle Big Data Videos – Version 1
Oracle Big Data Videos – Version 2
Other videos include
The content catalog for Oracle Open World 2012 was made public during the week. OOW is on between 30th September and 4th October.
The following table gives a list of most of the Data Analytics type sessions that are currently scheduled.
Why did I pick these sessions? If I was able to go to OOW then these are the sessions I would like to attend. Yes there would be many more sessions I would like to attend on the core DB technology and Development streams.
|CON6640 – Database Data Mining: Practical Enterprise R and Oracle Advanced Analytics||Husnu Sensoy|
|CON8688 – Customer Perspectives: Oracle Data Integrator||Gurcan Orhan – Software Architect & Senior Developer, Turkcell Technology R&D
Julien Testut – Product Manager, Oracle
|HOL10089 – Oracle Big Data Analytics and R||George Lumpkin – Vice President, Product Management, Oracle|
|CON8655 – Tackling Big Data Analytics with Oracle Data Integrator||Mala Narasimharajan – Senior Product Marketing Manager, Oracle
Michael Eisterer – Principal Product Manager, Oracle
|CON8436 – Data Warehousing and Big Data with the Latest Generation of Database Technology||George Lumpkin – Vice President, Product Management, Oracle|
|CON8424 – Oracle’s Big Data Platform: Settling the Debate||Martin Gubar – Director, Oracle
Kuassi Mensah – Director Product Management, Oracle
|CON8423 – Finding Gold in Your Data Warehouse: Oracle Advanced Analytics||Charles Berger – Senior Director, Product Management, Data Mining and Advanced Analytics, Oracle|
|CON8764 – Analytics for Oracle Fusion Applications: Overview and Strategy||Florian Schouten – Senior Director, Product Management/Strategy, Oracle|
|CON8330 – Implementing Big Data Solutions: From Theory to Practice||Josef Pugh – , Oracle|
|CON8524 – Oracle TimesTen In-Memory Database for Oracle Exalytics: Overview||Tirthankar Lahiri – Senior Director, Oracle|
|CON9510 – Oracle BI Analytics and Reporting: Where to Start?||Mauricio Alvarado – Principal Product Manager, Oracle|
|CON8438 – Scalable Statistics and Advanced Analytics: Using R in the Enterprise||Marcos Arancibia Coddou – Product Manager, Oracle Advanced Analytics, Oracle|
|CON4951 – Southwestern Energy’s Creation of the Analytical Enterprise||Jim Vick – , Southwestern Energy
Richard Solari – Specialist Leader, Deloitte Consulting LLP
|CON8311 – Mining Big Data with Semantic Web Technology: Discovering What You Didn’t Know||Zhe Wu – Consultant Member of Tech Staff, Oracle
Xavier Lopez – Director, Product Management, Oracle
|CON8428 – Analyze This! Analytical Power in SQL, More Than You Ever Dreamt Of||Hermann Baer – Director Product Management, Oracle
Andrew Witkowski – Architect, Oracle
|CON6143 – Big Data in Financial Services: Technologies, Use Cases, and Implications||Omer Trajman – , Cloudera
Ambreesh Khanna – Industry Vice President, Oracle
Sunil Mathew – Senior Director, Financial Services Industry Technology, Oracle
|CON8425 – Big Data: The Big Story||Jean-Pierre Dijcks – Sr. Principal Product Manager, Oracle|
|CON10327 – Recommendations in R: Scaling from Small to Big Data||Mark Hornick – Senior Manager, Oracle|
Download R : http://www.r-project.org/
R installation instructions : http://star-www.st-andrews.ac.uk/cran/
In previous post I gave the details of how you can use Regression in Oracle Data Miner to predict/forecast the lean of the tower in future years. This was based on building a regression model in ODM using the known lean/tilt of the tower for a range of years.
In this post I will show you how you can do the same tasks using the Oracle Data Miner functions in SQL and PL/SQL.
Step 1 – Create the table and data
The easiest way to do this is to make a copy of the PISA table we created in the previous blog post. If you haven’t completed this, then go to the blog post and complete step 1 and step 2.
create table PISA_2
as select * from PISA;
Step 2 – Create the ODM Settings table
We need to create a ‘settings’ table before we can use the ODM API’s in PL/SQL. The purpose of this table is to store all the configuration parameters needed for the algorithm to work. In our case we only need to set two parameters.
delete from pisa_2_settings;
INSERT INTO PISA_2_settings (setting_name, setting_value) VALUES
INSERT INTO PISA_2_settings (setting_name, setting_value) VALUES
Step 3 – Build the Regression Model
To build the regression model we need to use the CREATE_MODEL function that is part of the DBMS_DATA_MINING package. When calling this function we need to pass in the name of the model, the algorithm to use, the source data, the setting table and the target column we are interested in.
model_name => ‘PISA_REG_2’,
mining_function => dbms_data_mining.regression,
data_table_name => ‘pisa_2_build_v’,
case_id_column_name => null,
target_column_name => ’tilt’,
settings_table_name => ‘pisa_2_settings’);
After this we should have our regression model.
Step 4 – Query the Regression Model details
To find out what was produced as in the previous step we can query the data dictionary.
where model_name like ‘P%’;
where model_name like ‘P%’;
Step 5 – Apply the Regression Model to new data
Our final step would be to apply it to our new data i.e. the years that we want to know what the lean/tilt would be.
SELECT year_measured, prediction(pisa_reg_2 using *)
This blog post will look at how you can use the Regression feature in Oracle Data Miner (ODM) to predict the lean/tilt of the Leaning Tower of Pisa in the future.
This is a well know regression exercise, and it typically comes with a set of know values and the year for these values. There are lots of websites that contain the details of the problem. A summary of it is:
The following table gives measurements for the years 1975-1985 of the “lean” of the Leaning Tower of Pisa. The variable “lean” represents the difference between where a point on the tower would be if the tower were straight and where it actually is. The data is coded as tenths of a millimetre in excess of 2.9 meters, so that the 1975 lean, which was 2.9642.
Given the lean for the years 1975 to 1985, can you calculate the lean for a future date like 200, 2009, 2012.
Step 1 – Create the table
Connect to a schema that you have setup for use with Oracle Data Miner. Create a table (PISA) with 2 attributes, YEAR_MEASURED and TILT. Both of these attributes need to have the datatype of NUMBER, as ODM will ignore any of the attributes if they are a VARCHAR or you might get an error.
CREATE TABLE PISA
Step 2 – Insert the data
There are 2 sets of data that need to be inserted into this table. The first is the data from 1975 to 1985 with the known values of the lean/tilt of the tower. The second set of data is the future years where we do not know the lean/tilt and we want ODM to calculate the value based on the Regression model we want to create.
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1975,2.9642);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1976,2.9644);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1977,2.9656);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1978,2.9667);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1979,2.9673);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1980,2.9688);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1981,2.9696);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1982,2.9698);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1983,2.9713);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1984,2.9717);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1985,2.9725);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1986,2.9742);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1987,2.9757);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1988,null);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1989,null);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1990,null);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (1995,null);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (2000,null);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (2005,null);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (2010,null);
Insert into DMUSER.PISA (YEAR_MEASURED,TILT) values (2009,null);
Step 3 – Start ODM and Prepare the data
Open SQL Developer and open the ODM Connections tab. Connect to the schema that you have created the PISA table in. Create a new Project or use an existing one and create a new Workflow for your PISA ODM work.
Create a Data Source node in the workspace and assign the PISA table to it. You can select all the attributes..
The table contains the data that we need to build our regression model (our training data set) and the data that we will use for predicting the future lean/tilt (our apply data set).
We need to apply a filter to the PISA data source to only look at the training data set. Select the Filter Rows node and drag it to the workspace. Connect the PISA data source to the Filter Rows note. Double click on the Filter Row node and select the Expression Builder icon. Create the where clause to select only the rows where we know the lean/tilt.
Step 4 – Create the Regression model
Select the Regression Node from the Models component palette and drop it onto your workspace. Connect the Filter Rows node to the Regression Build Node.
Double click on the Regression Build node and set the Target to the TILT variable. You can leave the Case ID at . You can also select if you want to build a GLM or SVM regression model or both of them. Set the AUTO check box to unchecked. By doing this Oracle will not try to do any data processing or attribute elimination.
You are now ready to create your regression models.
To do this right click the Regression Build node and select Run. When everything is finished you will get a little green tick on the top right hand corner of each node.
Step 5 – Predict the Lean/Tilt for future years
The PISA table that we used above, also contains our apply data set
We need to create a new Filter Rows node on our workspace. This will be used to only look at the rows in PISA where TILT is null. Connect the PISA data source node to the new filter node and edit the expression builder.
Next we need to create the Apply Node. This allows us to run the Regression model(s) against our Apply data set. Connect the second Filter Rows node to the Apply Node and the Regression Build node to the Apply Node.
Double click on the Apply Node. Under the Apply Columns we can see that we will have 4 attributes created in the output. 3 of these attributes will be for the GLM model and 1 will be for the SVM model.
Click on the Data Columns tab and edit the data columns so that we get the YEAR_MEASURED attribute to appear in the final output.
Now run the Apply node by right clicking on it and selecting Run.
Step 6 – Viewing the results
Where we get the little green tick on the Apply node we know that everything has run and completed successfully.
To view the predictions right click on the Apply Node and select View Data from the menu.
We can see the the GLM mode gives the results we would expect but the SVM does not.
A few weeks ago I had a blog post called Domain Knowledge + Data Skills = Data Miner.
In that blog post I was saying that to be a Data Scientist all you needed was Domain Knowledge and some Data Skills, which included Data Mining.
The reality is that the skill set of a Data Scientist will be much larger. There is a saying ‘A jack of all trades and a master of none’. When it comes to being a data scientist you need to be a bit like this but perhaps a better saying would be ‘A jack of all trades and a master of some’.
I’ve put together the following diagram, which includes most of the skills with an out circle of more fundamental skills. It is this outer ring of skills that are fundamental in becoming a data scientist. The skills in the inner part of the diagram are skills that most people will have some experience in one or more of them. The other skills can be developed and learned over time, all depending on the type of person you are.
Can we train someone to become a data scientist or are they born to be a data scientist. It is a little bit of both really but you need to have some of the fundamental skills and the right type of personality. The learning of the other skills should be easy(ish)
What do you think? Are their Skill that I’m missing?
Over the past few weeks I have been talking to a lot of people who are looking at how data mining can be used in their organisation, for their projects and to people who have been doing data mining for a log time.
What comes across from talking to the experienced people, and these people are not tied to a particular product, is that you need to concentrate on the business problem. Once you have this well defined then you can drill down to the deeper levels of the project. Some of these levels will include what data is needed (not what data you have), tools, algorithms, etc.
Statistics is only a very small part of a data mining project. Some people who have PhDs in statistics who work in data mining say you do not use or very rarely use their statistics skills.
Some quotes that I like are:
“Focus hard on Business Question and the relevant target variable that captures the essence of the question.” Dean Abbott PAW Conf April 2012
“Find me something interesting in my data is a question from hell. Analysis should be guided by business goals.” Colin Shearer PAW Conf Oct 2011
There has need a lot of blog posting and articles on what are the key skills for a Data Miner and the more popular Data Scientist. What is very clear from all of these is that you will spend most of your time looking at, examining, integrating, manipulating, preparing, standardising and formatting the data. It has been quoted that all of these tasks can take up to 70% to 85% of a Data Mining/Data Scientist time. All of these tasks are commonly performed by database developers and in particular the developers and architects involved in Data Warehousing projects. The rest of the time for the running of the data mining algorithms, examining the results, and yes some stats too.
Every little time is spent developing algorithms!!! Why is this ? Would it be that the algorithms are already developed (for a long time now and are well turned) and available in all the data mining tools. We can almost treat these algorithms as a black box. So one of the key abilities of a data miner/data scientist would be to know what the algorithms can do, what kind of problems they can be used for, know what kind of outputs they produce, etc.
Domain knowledge is important, no matter how little it is, in preparing for and being involved in a data mining project. As we define our business problem the domain expert can bring their knowledge to the problem and allows us separate the domain related problems from the data related problems. So the domain expertise is critical at that start of a project, but the domain expertise is also critical when we have the outputs from the data mining algorithms. We can use the domain knowledge to tied the outputs from the data mining algorithms back to the original problem to bring real meaning to the original business problem we are working on.
So what is the formula of skill sets for a data mining or data scientist. Well it is a little like the title of this blog;
Domain Knowledge + Data Skills + Data Mining Skills + a little bit of Machine Learning + a little bit of Stats = a Data Miner / Data Scientist