Oracle Machine Learning
In my previous blog post I went through the steps of setting up OCI to allow you to access OCI Data Science. Those steps showed the setup and configuration for your Data Science Team.
In this post I will walk through the steps necessary to create an OCI Data Science Project and Notebook, and will then Explore the basic Notebook environment.
1 – Create a Project
From the main menu on the Oracle Cloud home page select Data Science -> Projects from the menu.
Select the appropriate Compartment in the drop-down list on the left hand side of the screen. In my previous blog post I created a separate Compartment for my Data Science work and team. Then click on the Create Projects button.
Enter a name for your project. I called this project, ‘DS-Demo-Project’. Click Create button.
That’s the Project created.
2 – Create a Notebook
After creating a project (see above) you can not create one or many Notebook Sessions.
To create a Notebook Session click on the Create Notebook Session button (see the above image). This will create a VM to contain your notebook and associated work. Just like all VM in Oracle Cloud, they come in various different shapes. These can be adjusted at a later time to scale up and then back down based on the work you will be performing.
The following example creates a Notebook Session using the basic VM shape. I call the Notebook ‘DS-Demo-Notebook’. I also set the Block Storage size to 50G, which is the minimum value. The VNC details have been defaulted to those assigned to the Compartment. Click Create button at the bottom of the page.
The Notebook Session VM will be created. This might take a few minutes. When created you will see a screen like the following.
3 – Open the Notebook
After completing the above steps you can now open the Notebook Session in your browser. Either click on the Open button (see above image), or copy the link and share with your data science team.
Important: There are a few important considerations when using the Notebooks. While the session is running you will be paying for it, even if the session got terminated at the browser or you lost connect. To manage costs, you may need to stop the Notebook session. More details on this in a later post.
After clicking on the Open button, a new browser tab will open and will ask you to log-in.
After logging in you will see your Notebook.
4 – Explore the Notebook Environment
The Notebook comes pre-loaded with lots of goodies.
The menu on the left-hand side provides a directory with lots of sample Notebooks, access to the block storage and a sample getting started Notebook.
When you are ready to create your own Notebook you can click on the icon for that.
Or if you already have a Notebook, created elsewhere, you can load that into your OCI Data Science environment.
The uploaded Notebook will appear in the list on the left-hand side of the screen.
After a very, very, very long wait (18+ months) Oracle OCI Data Science platform is now available.
But before you jump straight into using OCI Data Science, there is a little bit of setup required for your Cloud Tenancy. There is the easy simple approach and then there is the slightly more involved approach. These are
- Simple approach. Assuming you are just going to use the root tenancy and compartment, you just need to setup a new policy to enable the use of the OCI Data Science services. This assuming you have your VNC configuration complete with NAT etc. This can be done by creating a policy with the following policy statement. After creating this you can proceed with creating your first notebook in OCI Data Science.
allow service datascience to use virtual-network-family in tenancy
- Slightly more complicated approach. When you get into having a team based approach you will need to create some additional Oracle Cloud components to manage them and what resources are allocated to them. This involved creating Compartments, allocating users, VNCs, Policies etc. The following instructions brings you through these steps
IMPORTANT: After creating a Compartment or some of the other things listed below, and they are not displayed in the expected drop-down lists etc, then either refresh your screen or log-out and log back in again!
1. Create a Group for your Data Science Team & Add Users
The first step involves creating a Group to ‘group’ the various users who will be using the OCI Data Science services.
Go to Governance and Administration ->Identity and click on Groups.
Enter some basic descriptive information. I called my Group, ‘my-data-scientists’.
Now click on your Group in the list of Groups and add the users to the group.
You may need to create the accounts for the various users.
2. Create a Compartment for your Data Science work
Now create a new Compartment to own the network resources and the Data Science resources.
Go to Governance and Administration ->Identity and click on Compartments.
Enter some basic descriptive information. I’ve called my compartment, ‘My-DS-Compartment’.
3. Create Network for your Data Science work
Creating and setting up the VNC can be a little bit of fun. You can do it the manual way whereby you setup and configure everything. Or you can use the wizard to do this. I;m going to show the wizard approach below.
But the first thing you need to do is to select the Compartment the VNC will belong to. Select this from the drop-down list on the left hand side of the Virtual Cloud Network page. If your compartment is not listed, then log-out and log-in!
To use the wizard approach click the Networking QuickStart button.
Select the option ‘VCN with Internet Connectivity and click Start Workflow, as you will want to connect to it and to allow the service to connect to other cloud services.
I called my VNC ‘My-DS-vnc’ and took the default settings. Then click the Next button.
The next screen shows a summary of what will be done. Click the Create button, and all of these networking components will be created.
All done with creating the VNC.
4. Create required Policies enable OCI Data Science for your Compartment
There are three policies needed to allocated the necessary resources to the various components we have just created. To create these go to Governance and Administration ->Identity and click on Policies.
Select your Compartment from the drop-down list. This should be ‘My-DS-Compartment’, then click on Create Policy.
The first policy allocates a group to a compartment for the Data Science services. I called this policy, ‘DS-Manage-Access’.
allow group My-data-scientists to manage data-science-family in compartment My-DS-Compartment
The next policy is to give the Data Science users access to the network resources. I called this policy, ‘DS-Manage-Network’.
allow group My-data-scientists to use virtual-network-family in compartment My-DS-Compartment
And the third policy is to give Data Science service access to the network resources. I called this policy, ‘DS-Network-Access’.
allow service datascience to use virtual-network-family in compartment My-DS-Compartment
Job Done 🙂
You are now setup to run the OCI Data Science service. Check out my Blog Post on creating your first OCI Data Science Notebook and exploring what is available in this Notebook.
When you comes to working with R to access and process your data there are a number of little features and behaviors you need to look out for.
One of these is the DATE datatype.
The main issue that you have to look for is the TIMEZONE conversion that happens then you extract the data from the database into your R environment.
There is a datatype conversions from the Oracle DATE into the POSIXct format. The POSIXct datatype also includes the timezone. But the Oracle DATE datatype does not have a Timezone part of it.
When you look into this a bit more you will see that the main issue is what Timezone your R session has. By default your R session will inherit the OS session timezone. For me here in Ireland we have the time timezone as the UK. You would time that the timezone would therefore be GMT. But this is not the case. What we have for timezone is BST (or British Standard Time) and this takes into account the day light savings time. So on the 26th May, BST is one hour ahead of GMT.
OK. Let’s have a look at a sample scenario.
As mentioned above, when I select date of type DATE from Oracle into R, using ROracle, I end up getting a different date value than what was in the database. Similarly when I process and store the data.
The following outlines the data setup and some of the R code that was used to generate the issue/problem.
Create a table that contains a DATE field and insert some records.
CREATE TABLE STAFF (STAFF_NUMBER VARCHAR2(20), FIRST_NAME VARCHAR2(20), SURNAME VARCHAR2(20), DOB DATE, PROG_CODE VARCHAR2(6 BYTE), PRIMARY KEY (STAFF_NUMBER)); insert into staff values (123456789, 'Brendan', 'Tierney', to_date('01/06/1975', 'DD/MM/YYYY'), 'DEPT_1'); insert into staff values (234567890, 'Sean', 'Reilly', to_date('21/10/1980', 'DD/MM/YYYY'), 'DEPT_2'); insert into staff values (345678901, 'John', 'Smith', to_date('12/03/1973', 'DD/MM/YYYY'), 'DEPT_3'); insert into staff values (456789012, 'Barry', 'Connolly', to_date('25/01/1970', 'DD/MM/YYYY'), 'DEPT_4');
You can query this data in SQL without any problems. As you can see there is no timezone element to these dates.
Selecting the data
I now establish my connection to my schema in my 12c database using ROracle. I won’t bore you with the details here of how to do it but check out point 3 on this post for some details.
When I select the data I get the following.
> res<-dbSendQuery(con, "select * from staff") > data <- fetch(res) > data$DOB  "1975-06-01 01:00:00 BST" "1980-10-21 01:00:00 BST" "1973-03-12 00:00:00 BST"  "1970-01-25 01:00:00 BST"
As you can see two things have happened to my date data when it has been extracted from Oracle. Firstly it has assigned a timezone to the data, even though there was no timezone part of the original data. Secondly it has performed some sort of timezone conversion to from GMT to BST. The difference between GMT and BTS is the day light savings time. Hence the 01:00:00 being added to the time element that was extract. This time should have been 00:00:00. You can see we have a mixture of times!
So there appears to be some difference between the R date or timezone to what is being used in Oracle.
To add to this problem I was playing around with some dates and different records. I kept on getting this scenario but I also got the following, where we have a mixture of GMT and BST times and timezones. I’m not sure why we would get this mixture.
> data$DOB  "1995-01-19 00:00:00 GMT" "1965-06-20 01:00:00 BST" "1973-10-20 01:00:00 BST"  "2000-12-28 00:00:00 GMT"
This is all a bit confusing and annoying. So let us look at how you can now fix this.
Fixing the problem : Setting Session variables
What you have to do to fix this and to ensure that there is consistency between that is in Oracle and what is read out and converted into R (POSIXct) format, you need to define two R session variables. These session variables are used to ensure the consistency in the date and time conversions.
These session variables are TZ for the R session timezone setting and Oracle ORA_SDTZ setting for specifying the timezone to be used for your Oracle connections.
The trick there is that these session variables need to be set before you create your ROracle connection. The following is the R code to set these session variables.
> Sys.setenv(TZ = "GMT") > Sys.setenv(ORA_SDTZ = "GMT")
So you really need to have some knowledge of what kind of Dates you are working with in the database and if a timezone if part of it or is important. Alternatively you could set the above variables to UDT.
Selecting the data (correctly this time)
Now when we select our data from our table in our schema we now get the following, after reconnecting or creating a new connection to your Oracle schema.
> data$DOB  "1975-06-01 GMT" "1980-10-21 GMT" "1973-03-12 GMT" "1970-01-25 GMT"
Now you can see we do not have any time element to the dates and this is correct in this example. So all is good.
We can now update the data and do whatever processing we want with the data in our R script.
But what happens when we save the data back to our Oracle schema. In the following R code we will add 2 days to the DOB attribute and then create a new table in our schema to save the updated data.
> data$DOB  "1975-06-01 GMT" "1980-10-21 GMT" "1973-03-12 GMT" "1970-01-25 GMT" > data$DOB <- data$DOB + days(2) > data$DOB  "1975-06-03 GMT" "1980-10-23 GMT" "1973-03-14 GMT" "1970-01-27 GMT"
> dbWriteTable(con, "STAFF_2", data, overwrite = TRUE, row.names = FALSE)  TRUE
I’ve used the R package Libridate to do the date and time processing.
When we look at this newly created table in our Oracle schema we will see that we don’t have DATA datatype for DOB, but instead it is created using a TIMESTAMP data type.
If you are working with TIMESTAMP etc type of data types (i.e. data types that have a timezone element that is part of it) then that is a slightly different problem.
When working with Oracle Machine Learning (OML) you are creating notebooks which focus on a particular data exploration and possibly some machine learning. Despite it’s name, OML is used extensively for data discovery and data exploration.
One of the aims of using OML, or notebooks in general, is that these can be easily shared with other people either within the same team or beyond. Something to consider when sharing notebooks is what you are allowing other people do with your notebook. Without any permissions you are allowing people to inspect, run and modify the notebooks. This can be a problem because those people you are sharing with may or may not be allowed to make modification. Some people should be able to just view the notebook, and others should be able to more advanced tasks.
With OML Notebooks there are four primary types of people who can access Notebooks and these can have different privileges. These are defined as
- Developer : Can create new notebooks withing a project and workspace but cannot create a workspace or a project. Can create and run a notebook as a scheduled job.
- Viewer : They can just view projects, Workspaces and notebooks. They are not allowed to create or run anything.
- Manager : can create new notebooks and projects. But only view Workspaces. Additionally they can schedule notebook jobs.
- Administrators : Administrators of the OML environment do not have any edit capabilities on notebooks. But they can view them.
When using Oracle Machine Learning notebooks, you can export and import these between different projects and different environments (from ADW to ATP).
But something to watch out for when you import a notebook into your ADW or ATP environment is to reset the Interpreter Bindings.
When you create a new OML Notebook and build it up, the various Interpreter Bindings are automatically set or turned on. But for Imported OML Notebooks they are not turned on.
I’m assuming this will be fixed at some future point.
If you import an OML Notebook and turn on the Interpreter Bindings you may find the code in your notebook cells running very slowly
To turn on these binding, click on the options icon as indicated by the red box in the following image.
You will get something like the following being displayed. None of the bindings are highlighted.
To enable the Interpreter Bindings just click on each of these boxes. When you do this each one will be highlighted and will turn a blue color.
All done! You can now run your OML Notebooks without any problems or delays.
Oracle Autonomous Database (ADW) has been out a while now and have had several, behind the scenes, improvements and new/additional features added.
If you have used the Oracle Machine Learning (OML) component of ADW you will have seen the various sample OML Notebooks that come pre-loaded. These are easy to open, use and to try out the various OML features.
The above image shows the top part of the login screen for OML. To see the available sample notebooks click on the Examples icon. When you do, you will get the following sample OML Notebooks.
But what if you have a notebook you have used elsewhere. These can be exported in json format and loaded as a new notebook in OML.
To load a new notebook into OML, select the icon (three horizontal line) on the top left hand corner of the screen. Then select Notebooks from the menu.
Then select the Import button located at the top of the Notebooks screen. This will open a File window, where you can select the json file from your file system.
A couple of seconds later the notebook will be available and listed along side any other notebooks you may have created.
You have now imported a new notebook into OML and can now use it to process your data and perform machine learning using the in-database features.