A very popular tool for data scientists is RStudio. This tool allows you to interactively work with your R code, view the R console, the graphs and charts you create, manage the various objects and data frames you create, as well shaving easy access to the R help documentation. Basically it is a core everyday tool.
The typical approach is to have RStudio installed on your desktop or laptop. What this really means is that the data is pulled to your desktop or laptop and all analytics is performed there. In most cases this is fine but as your data volumes goes does does the limitations of using R on your local machine.
An alternative is to install a version called RStudio Server on an analytics server or on the database server. You can now use the computing capabilities of this server to overcome some of the limitations of using R or RStudio locally. Now you will use your web browser to access RStudio Server on your database server.
In this blog post I will walk you through how to install and get connected to RStudio Server on the Oracle BigDataLite VM.
After starting up the Oracle BigDataLite VM and logging into the Oracle user (password=welcome1) you will see the Start Here icon on the desktop. You will need to double click on this.
This will open a webpage on the VM that contains details of all the various tools that are installed on the VM or are ready for you to install and configure. This information contains all the http addresses and ports you need to access each of these tools via a web browser or some other way, along with the usernames and passwords you need to use them.
One of the tools lists is for RStudio Server. This product is not installed on the VM but Oracle has provided a script that you can run to perform the install in an automated way. This script is located in:
[oracle@bigdatalite ~]$ cd /home/oracle/scripts/
Use the following command to run the RStudio Server install script.
[oracle@bigdatalite scripts]$ ./install_rstudio.sh
The following is the output from running this script and it will be displayed in your terminal window. You can use this to monitor the progress of the installation.
Retrieving RStudio --2016-03-12 02:06:15-- https://download2.rstudio.org/rstudio-server-rhel-0.99.489-x86_64.rpm Resolving download2.rstudio.org... 18.104.22.168, 22.214.171.124, 126.96.36.199, ... Connecting to download2.rstudio.org|188.8.131.52|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 34993428 (33M) [application/x-redhat-package-manager] Saving to: `rstudio-server-rhel-0.99.489-x86_64.rpm' 100%[======================================>] 34,993,428 5.24M/s in 10s 2016-03-12 02:06:26 (3.35 MB/s) - `rstudio-server-rhel-0.99.489-x86_64.rpm' saved [34993428/34993428] Installing RStudio Loaded plugins: refresh-packagekit, security, ulninfo Setting up Install Process Examining rstudio-server-rhel-0.99.489-x86_64.rpm: rstudio-server-0.99.489-1.x86_64 Marking rstudio-server-rhel-0.99.489-x86_64.rpm to be installed public_ol6_UEKR3_latest | 1.2 kB 00:00 public_ol6_UEKR3_latest/primary | 22 MB 00:03 public_ol6_UEKR3_latest 568/568 public_ol6_latest | 1.4 kB 00:00 public_ol6_latest/primary | 55 MB 00:12 public_ol6_latest 33328/33328 Resolving Dependencies --> Running transaction check ---> Package rstudio-server.x86_64 0:0.99.489-1 will be installed --> Finished Dependency Resolution Dependencies Resolved ================================================================================ Package Arch Version Repository Size ================================================================================ Installing: rstudio-server x86_64 0.99.489-1 /rstudio-server-rhel-0.99.489-x86_64 251 M Transaction Summary ================================================================================ Install 1 Package(s) Total size: 251 M Installed size: 251 M Downloading Packages: Running rpm_check_debug Running Transaction Test Transaction Test Succeeded Running Transaction Installing : rstudio-server-0.99.489-1.x86_64 1/1 useradd: user 'rstudio-server' already exists groupadd: group 'rstudio-server' already exists rsession: no process killed rstudio-server start/running, process 5037 Verifying : rstudio-server-0.99.489-1.x86_64 1/1 Installed: rstudio-server.x86_64 0:0.99.489-1 Complete! Restarting RStudio rstudio-server stop/waiting rsession: no process killed rstudio-server start/running, process 5066
When the installation is finished you are now ready to connect to the RStudio Server. So open your web browser and enter the following into the address bar.
The initial screen you are presented with is a login screen. Enter your Linux username and password. In the case of the BigDataLite VM this will be oracle/welcome1.
Then you will be presented with the RStudio Server application in your web browser, as shown below. As you can see it is very similar to using RStudio on your desktop. Happy Days! You are now setup and able to run RStudio on the database server.
Make sure to log out of RStudio Server before closing down the window.
If you don’t log out of RStudio Server then the next time you open RStudio Server your session will automatically open. Perhaps this is not the best for security, so try to remember to log out each time.
By now using RStudio Server on the Oracle Database server I can not get some of the benefits of computing capabilities of this server. Although there are still the typical limitations with of using R. But now I access RStudio on the database server and process the data on the database server, all from my local PC or laptop.
Everything is nicely setup and ready for you to install on the BigDataLite VM (thank you Oracle). But what about when we want to install RStudion Server on a different server. What are the steps necessary to install, configure and log in. Yes they should be similar but I will give a complete list of steps in my next blog post.
When you comes to working with R to access and process your data there are a number of little features and behaviours you need to look out for.
One of these is the DATE datatype.
The main issue that you have to look for is the TIMEZONE conversion that happens then you extract the data from the database into your R environment.
There is a datatype conversions from the Oracle DATE into the POSIXct format. The POSIXct datatype also includes the timezone. But the Oracle DATE datatype does not have a Timezone part of it.
When you look into this a bit more you will see that the main issue is what Timezone your R session has. By default your R session will inherit the OS session timezone. For me here in Ireland we have the time timezone as the UK. You would time that the timezone would therefore be GMT. But this is not the case. What we have for timezone is BST (or British Standard Time) and this takes into account the day light savings time. So on the 26th May, BST is one hour ahead of GMT.
This issue is not just between R and Oracle, but there are some inconsistencies in some of Oracle’s other tools list SQL Developer, SQLcl and SQL*Plus.
SQLcl & SQL Developer
OK. Let’s get back to the issue with we are using R to analyse our data.
As mentioned above, when I select date of type DATE from Oracle into R, using ROracle, I end up getting a different date value than what was in the database. Similarly when I process and store the data.
The following outlines the data setup and some of the R code that was used to generate the issue/problem.
Create a table that contains a DATE field and insert some records.
CREATE TABLE STAFF (STAFF_NUMBER VARCHAR2(20), FIRST_NAME VARCHAR2(20), SURNAME VARCHAR2(20), DOB DATE, PROG_CODE VARCHAR2(6 BYTE), PRIMARY KEY (STAFF_NUMBER)); insert into staff values (123456789, 'Brendan', 'Tierney', to_date('01/06/1975', 'DD/MM/YYYY'), 'DEPT_1'); insert into staff values (234567890, 'Sean', 'Reilly', to_date('21/10/1980', 'DD/MM/YYYY'), 'DEPT_2'); insert into staff values (345678901, 'John', 'Smith', to_date('12/03/1973', 'DD/MM/YYYY'), 'DEPT_3'); insert into staff values (456789012, 'Barry', 'Connolly', to_date('25/01/1970', 'DD/MM/YYYY'), 'DEPT_4');
You can query this data in SQL without any problems. As you can see there is no timezone element to these dates.
Selecting the data
I now establish my connection to my schema in my 12c database using ROracle. I won’t bore you with the details here of how to do it but check out point 3 on this post for some details.
When I select the data I get the following.
> res<-dbSendQuery(con, “select * from staff”) > data data$DOB
 “1975-06-01 01:00:00 BST” “1980-10-21 01:00:00 BST” “1973-03-12 00:00:00 BST”
 “1970-01-25 01:00:00 BST”
As you can see two things have happened to my date data when it has been extracted from Oracle. Firstly it has assigned a timezone to the data, even though there was no timezone part of the original data. Secondly it has performed some sort of timezone conversion to from GMT to BST. The difference between GMT and BTS is the day light savings time. Hence the 01:00:00 being added to the time element that was extract. This time should have been 00:00:00. You can see we have a mixture of times!
So there appears to be some difference between the R date or timezone to what is being used in Oracle.
To add to this problem I was playing around with some dates and different records. I kept on getting this scenario but I also got the following, where we have a mixture of GMT and BST times and timezones. I’m not sure why we would get this mixture.
 “1995-01-19 00:00:00 GMT” “1965-06-20 01:00:00 BST” “1973-10-20 01:00:00 BST”
 “2000-12-28 00:00:00 GMT”
This is all a bit confusing and annoying. So let us look at how you can now fix this.
Fixing the problem : Setting Session variables
What you have to do to fix this and to ensure that there is consistency between that is in Oracle and what is read out and converted into R (POSIXct) format, you need to define two R session variables. These session variables are used to ensure the consistency in the date and time conversions.
These session variables are TZ for the R session timezone setting and Oracle ORA_SDTZ setting for specifying the timezone to be used for your Oracle connections.
The trick there is that these session variables need to be set before you create your ROracle connection. The following is the R code to set these session variables.
> Sys.setenv(TZ = “GMT”)
> Sys.setenv(ORA_SDTZ = “GMT”)
So you really need to have some knowledge of what kind of Dates you are working with in the database and if a timezone if part of it or is important. Alternatively you could set the above variables to UDT.
Selecting the data (correctly this time)
Now when we select our data from our table in our schema we now get the following, after reconnecting or creating a new connection to your Oracle schema.
 “1975-06-01 GMT” “1980-10-21 GMT” “1973-03-12 GMT” “1970-01-25 GMT”
Now you can see we do not have any time element to the dates and this is correct in this example. So all is good.
We can now update the data and do whatever processing we want with the data in our R script.
But what happens when we save the data back to our Oracle schema. In the following R code we will add 2 days to the DOB attribute and then create a new table in our schema to save the updated data.
I’ve used the R package Libridate to do the date and time processing.
 “1975-06-01 GMT” “1980-10-21 GMT” “1973-03-12 GMT” “1970-01-25 GMT”
> data$DOB data$DOB
 “1975-06-03 GMT” “1980-10-23 GMT” “1973-03-14 GMT” “1970-01-27 GMT”
> dbWriteTable(con, “STAFF_2”, data, overwrite = TRUE, row.names = FALSE)
When we look at this newly created table in our Oracle schema we will see that we don’t have DATA datatype for DOB, but instead it is created using a TIMESTAMP data type.
If you are working with TIMESTAMP etc type of data types (i.e. data types that have a timezone element that is part of it) then that is a slightly different problem. Perhaps one that I’ll look at soonish.