Month: November 2016
Yes you read the title of this blog post correctly!
Recently I was doing a bit a clear out and I came across a CD of the UKOUG Conference proceedings from 1999. That was my second UKOUG conference and how times have changed.
The CD contained all the conference proceedings consisting of slides and papers.
Here are some familiar names from back in 1999 who you may find presenting at this years conference, some you might remember as being a regular presenter and some are still presenting but not at this years conference.
- Jonathan Lewis
- Carl Dudley
- Fiona Martin
- Peter Robson
- Duncan Mills
- Kent Graziano
- John King
- Toby Price
- Doug Burns
- Dan Hotka
- Joel Goodman
The 1999 Ralph Fiennes did the Keynote speech. I queued up afterwards to get a signed book but they ran out with three people ahead of me 😦
The agenda grid was a bit smaller back then compared to now.
I’ll see you again in Birmingham this year, in a few days time 🙂
If you are a user of the Oracle Data Miner tool (the workflow data mining tool that is part of SQL Developer), then you will have noticed that for many of the algorithms you can specify a Case Id attribute along with, say, the target attribute.
The idea is that you have one attribute that is a unique identifier for each case record. This may or may not be the case in your data model and you may have a multiple attribute primary key or case record identifier.
But what is the Case Id field used for in Oracle Data Miner?
Based on the documentation this field does not need to have a value. But it is recommended that you do identify an attribute for the Case Id, as this will allow for reproducible results. What this means is that if we run our workflow today and again in a few days time, on the exact same data, we should get the same results. So the Case Id allows this to happen. But how? Well it looks like the attribute used or specified for the Case Id is used as part of the Hashing algorithm to partition the data into a train and test data set, for classification problems.
So if you don’t have a single attribute case identifier in your data set, then you need to create one. There are a few options open to you to do this.
- Create one: write some code that will generate a unique identifier for each of your case records based on some defined rule.
- Use a sequence: and update the records to use this sequence.
- Use ROWID: use the unique row identifier value. You can write some code to populate this value into an attribute. Or create a view on the table containing the case records and add a new attribute that will use the ROWID. But if you move the data, then the next time you use the view then you will be getting different ROWIDs and that in turn will mean we may have different case records going into our test and training data sets. So our workflows will generate different results. Not what we want.
- Use ROWNUM: This is kind of like using the ROWID. Again we can have a view that will select ROWNUM for each record. Again we may have the same issues but if we have our data ordered in a way that ensures we get the records returned in the same order then this approach is OK to use.
- Use Identity Column: In Oracle 12c we have a new feature called Identify Column. This kind of acts like a sequence but we can defined an attribute in a table to be an Identity Column, and as records are inserted into the the data (in our scenario our case table) then this column will automatically generate a unique number for our data. Again if we need to repopulate the case table, you will need to drop and recreate the table to get the Identity Column to reset, otherwise the newly inserted records will start with the next number of the Identity Column
Here is an example of using the Identity Column in a case table.
CREATE TABLE case_table ( id_column NUMBER GENERATED ALWAYS AS IDENTITY, affinity_card NUMBER, age NUMBER, cust_gender VARCHAR2(5), country_name VARCHAR2(20) ... );
You can now use this Identity Column as the Case Id in your Oracle Data Miner workflows.
The Oracle 12.2c Database has been released and is currently available as a Cloud Service. The on-site version should be with us soon.
A few weeks ago I listed some of the new features that you will find in the Oracle Data Miner GUI tool (check out that blog post). I’ll have another blog post soon that looks a bit closer at how the new OAA features are exposed in this tool.
In this blog post I will list most of the new database related features in Oracle 12.2. There is a lot of new features and a lot of updated features. Over the next few months (yes it will take that long) I’ll have blog posts on most of these.
The Oracle Advanced Analytics Option new features include:
- The first new feature is one that you cannot see. Yes that sound a bit odd. But the underlying architecture of OAA has been rebuilt to allow for the algorithms to scale significantly. This is also future proofing OAA for new features coming in future releases of the database.
- Explicit Semantic Analysis. This is a new algorithm allows us to perform text similarity comparison. This is a great new addition and and much, much easier now compared to what we may have had to do previously.
- Using R models using SQL. Although we have been able to do this in the previous version of the database, the framework and supports have been extended to allow for greater and easier usage of user defined R scripts and R models with the in-database environment.
- Partitioned Models. We can now build partitioned mining models. This is where you can specify an attribute and a separate model will be created based on each value in the attribute.
- Partitioned scoring. Similarly we can now dynamically score the data based on an partition attribute.
- Extentions to Association Rules. Over the past few releases of the database, additional insights to the workings and decision making of the algorithms have been included. In 12.2 we now have some additional insights for the Association Rules aglorithm where we can now get to see the calculation of values associated with rules.
- DBMS_DATA_MINING package extended. This PL/SQL package has been extended to include the functionality for the new features listed above. Additional it can now process R algorithms and models.
- SQL Function changes: Change to the followi ODM related SQL functions to allow for partitioned models. CLUSTER_DETAILS, CLUSTER_DISTANCE, CLUSTER_ID, CLUSTER_PROBABILITY, CLUSTER_SET, FEATURE_COMPARE, FEATURE_DETAILS, FEATURE_ID, FEATURE_SET, FEATURE_VALUE, ORA_DM_PARTITION_NAME, PREDICTION, PREDICTION_BOUNDS, PREDICTION_COST, PREDICTION_DETAILS, PREDICTION_PROBABILITY, PREDICTION_SET
- New SQL Hint for ODM models. We have had hints in SQL for many, many versions now, but with 12.2c we now have a hint for partitioned models, called GROUPING hint.
- New CREATE_MODEL function. With the existing CREATE_MODEL function the input data set for the function needed to be defined in a table or accessed using a view. Basically the data needed to resist somewhere. With CREAETE_MODEL2 you can now define the input data set based on a SELECT statement.
In addition to all of these changes there are also some new interesting DB, SQL and PL/SQL new features that are of particular interest for your data science, machine learning, advanced analytics (or whatever the current favourite marketing term is today) projects.
It is going to be a busy few months ahead, working through all of these new features and write blog posts on how to use each of them.
Have you ever been faced with having to generate a data set in the format that is needed by another analytics tool? or having to generate a data set in a particular format but you don’t have the software that generates that format? For example, if you are submitting data to the FDA and other bodies, you may need to submit the data in a SAS formatted file. There are a few ways you can go about this.
One option is that you can use the Haven R package to generate your dataset in SAS and SPSS formats. But you can also read in SAS and SPSS formatted files. I have to deal with these formatted data files all the time, and it can be a challenge, but I’ve recently come across the Haven R package that has just made my life just a little bit/lots easier. Now I can easily generate SAS and SPSS formatted data sets for my data in my Oracle Database, using R and ORE. ORE we can now use the embedded feature to build the generation of these data sets into some of our end-user applications.
Let us have a look at Haven and what it can do.
Firstly there is very little if any documentation online for it. That is ok so we will have to rely on the documentation that comes with the R packages. Again there isn’t much to help and that is because the R package mainly consists of functions to Read in these data sets, functions to Write these data sets and some additional functions for preparing data.
For reading in data sets we have the following functions:
# SAS read_sas("mtcars.sas7bdat") # Stata read_dta("mtcars.dta") # SPSS read_sav("mtcars.sav")
For writing data sets we have the following functions:
# SAS write_sas(mtcars, "mtcars.sas7bdat") # Stata write_dta(mtcars, "mtcars.dta") # SPSS write_sav(mtcars, "mtcars.sav")
Let us now work through an example of creating a SAS data set. We can use some of the sample data sets that come with the Oracle Database in the SH schema. I’m going to use the data in the CUSTOMER table to create a SAS data set. In the following code I’m using ORE to connect to the database but you can use your preferred method.
> library(ORE) > # Create your connection to the schema in the DB > ore.connect(user="sh", password="sh", host="localhost", service_name="PDB12C", port=1521, all=TRUE) > dim(CUSTOMERS)  55500 23 > names(CUSTOMERS)  "CUST_ID" "CUST_FIRST_NAME" "CUST_LAST_NAME"  "CUST_GENDER" "CUST_YEAR_OF_BIRTH" "CUST_MARITAL_STATUS"  "CUST_STREET_ADDRESS" "CUST_POSTAL_CODE" "CUST_CITY"  "CUST_CITY_ID" "CUST_STATE_PROVINCE" "CUST_STATE_PROVINCE_ID"  "COUNTRY_ID" "CUST_MAIN_PHONE_NUMBER" "CUST_INCOME_LEVEL"  "CUST_CREDIT_LIMIT" "CUST_EMAIL" "CUST_TOTAL"  "CUST_TOTAL_ID" "CUST_SRC_ID" "CUST_EFF_FROM"  "CUST_EFF_TO" "CUST_VALID"
Next we can prepare the data, take a subset of the data, reformat the data, etc. For me I just want to use the data as it is. All I need to do now is to pull the data from the database to my local R environment.
dat <- ore.pull(CUSTOMERS)
Then I need to load the Haven library and then create the SAS formatted file.
library(haven) write_sas(dat, "c:/app/my_customers.sas7bdat")
That’s it. Nice and simple.
But has it worked? Has it created the file correctly? Will it load into my SAS tool?
There is only one way to test this and that is to only it in SAS. I have an account on SAS OnDemand with access to several SAS products. I’m going to use SAS Studio.
Well it works! The following image shows SAS Studio after I had loaded the data set with the variables and data shown.
WARNING: When you load the data set into SAS you may get a warning message saying that it isn’t a SAS data set. What this means is that it is not a data set generated by SAS. But as you can see in the image above all the data got loaded OK and you can work away with it as normal in your SAS tools.
The next step is to test the loading of a SAS data set into R. I’m going to use one of the standard SAS data sets called PVA97NK.SAS7BDAT. If you have worked with SAS products then you will have come across this data set.
When you use Haven to load in your SAS data set, it will create the data in tribble format. This is a slight varient of a data.frame. So if you want the typical format of a data.frmae then you will need to convert the loaded data, as shown in the following code.
> data_read dim(data_read)  9686 28 > d class(data_read)  "tbl_df" "tbl" "data.frame" > class(d)  "data.frame" > head(d) TARGET_B ID TARGET_D GiftCnt36 GiftCntAll GiftCntCard36 GiftCntCardAll 1 0 00014974 NA 2 4 1 3 2 0 00006294 NA 1 8 0 3 3 1 00046110 4 6 41 3 20 ...
I think this package to going to make my life a little bit easier, and if you work with SPSS and SAS data sets then hopefully some of your tasks have become a little bit easier too.