OCI Data Science – Create a Project & Notebook, and Explore the Interface

Posted on Updated on

In my previous blog post I went through the steps of setting up OCI to allow you to access OCI Data Science. Those steps showed the setup and configuration for your Data Science Team.

Screenshot 2020-02-11 20.46.42

In this post I will walk through the steps necessary to create an OCI Data Science Project and Notebook, and will then Explore the basic Notebook environment.

1 – Create a Project

From the main menu on the Oracle Cloud home page select Data Science -> Projects from the menu.

Screenshot 2020-02-12 12.07.19

Select the appropriate Compartment in the drop-down list on the left hand side of the screen. In my previous blog post I created a separate Compartment for my Data Science work and team. Then click on the Create Projects button.

Screenshot 2020-02-12 12.09.11Enter a name for your project. I called this project, ‘DS-Demo-Project’. Click Create button.

Screenshot 2020-02-12 12.13.44

Screenshot 2020-02-12 12.14.44

That’s the Project created.

2 – Create a Notebook

After creating a project (see above) you can not create one or many Notebook Sessions.

To create a Notebook Session click on the Create Notebook Session button (see the above image).  This will create a VM to contain your notebook and associated work. Just like all VM in Oracle Cloud, they come in various different shapes. These can be adjusted at a later time to scale up and then back down based on the work you will be performing.

The following example creates a Notebook Session using the basic VM shape. I call the Notebook ‘DS-Demo-Notebook’. I also set the Block Storage size to 50G, which is the minimum value. The VNC details have been defaulted to those assigned to the Compartment. Click Create button at the bottom of the page.

Screenshot 2020-02-12 12.22.24

The Notebook Session VM will be created. This might take a few minutes. When created you will see a screen like the following.

Screenshot 2020-02-12 12.31.21

3 – Open the Notebook

After completing the above steps you can now open the Notebook Session in your browser.  Either click on the Open button (see above image), or copy the link and share with your data science team.

Important: There are a few important considerations when using the Notebooks. While the session is running you will be paying for it, even if the session got terminated at the browser or you lost connect. To manage costs, you may need to stop the Notebook session. More details on this in a later post.

After clicking on the Open button, a new browser tab will open and will ask you to log-in.

Screenshot 2020-02-12 12.35.26

After logging in you will see your Notebook.

Screenshot 2020-02-12 12.37.42

4 – Explore the Notebook Environment

The Notebook comes pre-loaded with lots of goodies.

The menu on the left-hand side provides a directory with lots of sample Notebooks, access to the block storage and a sample getting started Notebook.

Screenshot 2020-02-12 12.41.09

When you are ready to create your own Notebook you can click on the icon for that.

Screenshot 2020-02-12 12.42.50

Or if you already have a Notebook, created elsewhere, you can load that into your OCI Data Science environment.

Screenshot 2020-02-12 12.44.50

The uploaded Notebook will appear in the list on the left-hand side of the screen.

OCI Data Science – Initial Setup and Configuration

Posted on Updated on

After a very, very, very long wait (18+ months) Oracle OCI Data Science platform is now available.

But before you jump straight into using OCI Data Science, there is a little bit of setup required for your Cloud Tenancy. There is the easy simple approach and then there is the slightly more involved approach. These are

  • Simple approach. Assuming you are just going to use the root tenancy and compartment, you just need to setup a new policy to enable the use of the OCI Data Science services. This assuming you have your VNC configuration complete with NAT etc. This can be done by creating a policy with the following policy statement. After creating this you can proceed with creating your first notebook in OCI Data Science.
allow service datascience to use virtual-network-family in tenancy

Screenshot 2020-02-11 19.46.38

  • Slightly more complicated approach. When you get into having a team based approach you will need to create some additional Oracle Cloud components to manage them and what resources are allocated to them. This involved creating Compartments, allocating users, VNCs, Policies etc. The following instructions brings you through these steps

IMPORTANT: After creating a Compartment or some of the other things listed below, and they are not displayed in the expected drop-down lists etc, then either refresh your screen or log-out and log back in again!

1. Create a Group for your Data Science Team & Add Users

The first step involves creating a Group to ‘group’ the various users who will be using the OCI Data Science services.

Go to Governance and Administration ->Identity and click on Groups.

Enter some basic descriptive information. I called my Group, ‘my-data-scientists’.

Now click on your Group in the list of Groups and add the users to the group.

You may need to create the accounts for the various users.

Screenshot 2020-02-11 12.03.58

2. Create a Compartment for your Data Science work

Now create a new Compartment to own the network resources and the Data Science resources.

Go to Governance and Administration ->Identity and click on Compartments.

Enter some basic descriptive information. I’ve called my compartment, ‘My-DS-Compartment’.

3. Create Network for your Data Science work

Creating and setting up the VNC can be a little bit of fun. You can do it the manual way whereby you setup and configure everything. Or you can use the wizard to do this. I;m going to show the wizard approach below.

But the first thing you need to do is to select the Compartment the VNC will belong to. Select this from the drop-down list on the left hand side of the Virtual Cloud Network page. If your compartment is not listed, then log-out and log-in!

To use the wizard approach click the Networking QuickStart button.

Screenshot 2020-02-11 20.15.28

Select the option ‘VCN with Internet Connectivity and click Start Workflow, as you will want to connect to it and to allow the service to connect to other cloud services.

Screenshot 2020-02-11 20.17.22

I called my VNC ‘My-DS-vnc’ and took the default settings. Then click the Next button.

Screenshot 2020-02-11 20.19.31

The next screen shows a summary of what will be done. Click the Create button, and all of these networking components will be created.

Screenshot 2020-02-11 20.22.39

All done with creating the VNC.

4. Create required Policies enable OCI Data Science for your Compartment

There are three policies needed to allocated the necessary resources to the various components we have just created. To create these go to Governance and Administration ->Identity and click on Policies.

Select your Compartment from the drop-down list. This should be ‘My-DS-Compartment’, then click on Create Policy.

The first policy allocates a group to a compartment for the Data Science services. I called this policy, ‘DS-Manage-Access’.

allow group My-data-scientists to manage data-science-family in compartment My-DS-Compartment

Screenshot 2020-02-11 20.30.10

The next policy is to give the Data Science users access to the network resources. I called this policy, ‘DS-Manage-Network’.

allow group My-data-scientists to use virtual-network-family in compartment My-DS-Compartment

Screenshot 2020-02-11 20.37.47

And the third policy is to give Data Science service access to the network resources. I called this policy, ‘DS-Network-Access’.

allow service datascience to use virtual-network-family in compartment My-DS-Compartment

Screenshot 2020-02-11 20.41.01

Job Done 🙂

You are now setup to run the OCI Data Science service.  Check out my Blog Post on creating your first OCI Data Science Notebook and exploring what is available in this Notebook.

Data Science (The MIT Press Essential Knowledge series) – available in English, Korean and Chinese

Posted on Updated on

Back in the middle of 2018 MIT Press published my Data Science book, co-written with John Kelleher. It book was published as part of their Essentials Series.

During the few months it was available in 2018 it became a best seller on Amazon, and one of the top best selling books for MIT Press. This happened again in 2019. Yes, two years running it has been a best seller!

2020 kicks off with the book being translated into Korean and Chinese. Here are the covers of these translated books.

The Japanese and Turkish translations will be available in a few months!

Go get the English version of the book on Amazon in print, Kindle and Audio formats.

https://amzn.to/2qC84KN

This book gives a concise introduction to the emerging field of data science, explaining its evolution, relation to machine learning, current uses, data infrastructure issues and ethical challenge the goal of data science is to improve decision making through the analysis of data. Today data science determines the ads we see online, the books and movies that are recommended to us online, which emails are filtered into our spam folders, even how much we pay for health insurance.

Go check it out.

Amazon.com.          Amazon.co.uk

Screenshot 2020-02-05 11.46.03

Scottish Whisky Data Set – Updated

Posted on Updated on

The Scottish Whiskey data set consist of tasting notes and evaluations from 86 distilleries around Scotland. This data set has been around a long time andwas a promotional site for a book, Whisky Classified: Choosing Single Malts by Flavour. Written by David Wishart of the University of Saint Andrews, the book had its most recent printing in February 2012.

I’ve been using this data set in one of my conference presentations (Planning my Summer Vacation), but to use this data set I need to add 2 new attributes/features to the data set. Each of the attributes are listed below and the last 2 are the attributes I added. These were added to include the converted LAT and LONG comparable with Google Maps and other similar mapping technology.

Attributes include:

  • RowID
  • Distillery
  • Body
  • Sweetness
  • Smoky
  • Medicinal
  • Tobacco
  • Honey
  • Spicy
  • Winey
  • Nutty,
  • Malty,
  • Fruity,
  • Floral,
  • Postcode,
  • Latitude,
  • Longitude
  • lat  — newly added
  • long  — newly added

Here is the link to download and use this updated Scottish Whisky data set.

The original website is no longer available but if you have a look at the Internet Archive you will find the website.

Screenshot 2020-01-23 14.44.53

Python-Connecting to multiple Oracle Autonomous DBs in one program

Posted on Updated on

More and more people are using the FREE Oracle Autonomous Database for building new new applications, or are migrating to it.

I’ve previously written about connecting to an Oracle Database using Python. Check out that post for details of how to setup Oracle Client and the Oracle Python library cx_Oracle.

In thatblog post I gave examples of connecting to an Oracle Database using the HostName (or IP address), the Service Name or the SID.

But with the Autonomous Oracle Database things are a little bit different. With the Autonomous Oracle Database (ADW or ATP) you will need to use an Oracle Wallet file. This file contains some of the connection details, but you don’t have access to ServiceName/SID, HostName, etc.  Instead you have the name of the Autonomous Database. The Wallet is used to create a secure connection to the Autonomous Database.

You can download the Wallet file from the Database console on Oracle Cloud.

Screenshot 2020-01-10 12.24.10

Most people end up working with multiple database. Sometimes these can be combined into one TNSNAMES file. This can make things simple and easy. To use the download TNSNAME file you will need to set the TNS_ADMIN environment variable. This will allow Python and cx_Oracle library to automatically pick up this file and you can connect to the ATP/ADW Database.

But most people don’t work with just a single database or use a single TNSNAMES file. In most cases you need to switch between different database connections and hence need to use multiple TNSNAMES files.

The question is how can you switch between ATP/ADW Database using different TNSNAMES files while inside one Python program?

Use the os.environ setting in Python. This allows you to reassign the TNS_ADMIN environment variable to point to a new directory containing the TNSNAMES file. This is a temporary assignment and over rides the TNS_ADMIN environment variable.

For example,

import cx_Oracle
import os

os.environ['TNS_ADMIN'] = "/Users/brendan.tierney/Dropbox/wallet_ATP"

p_username = ''p_password = ''p_service = 'atp_high'
con = cx_Oracle.connect(p_username, p_password, p_service)

print(con)
print(con.version)
con.close()

I can now easily switch to another ATP/ADW Database, in the same Python program, by changing the value of os.environ and opening a new connection.

import cx_Oracle
import os

os.environ['TNS_ADMIN'] = "/Users/brendan.tierney/Dropbox/wallet_ATP"
p_username = ''
p_password = ''
p_service = 'atp_high'
con1 = cx_Oracle.connect(p_username, p_password, p_service)
...
con1.close()

...
os.environ['TNS_ADMIN'] = "/Users/brendan.tierney/Dropbox/wallet_ADW2"
p_username = ''
p_password = ''
p_service = 'ADW2_high'
con2 = cx_Oracle.connect(p_username, p_password, p_service)
...
con2.close()

As mentioned previously the setting and resetting of TNS_ADMIN using os.environ, is only temporary, and when your Python program exists or completes the original value for this environment variable will remain.

#GE2020 Comparing Party Manifestos to 2016

Posted on

A few days ago I wrote a blog post about using Python to analyze the 2016 general (government) elections manifestos of the four main political parties in Ireland.

Today the two (traditional) largest parties released their #GE2020 manifestos. You can get them by following these links

The following images show the WordClouds generated for the #GE2020 Manifestos. I used the same Python code used in my previous post. If you want to try this out yourself, all the Python code is there.

First let us look at the WordClouds from Fine Gael.

FG2020
2020 Manifesto
FG_2016
2016 Manifesto

Now for the Fianna Fail WordClouds.

FF2020
2020 Manifesto
FF_2016
2016 Manifesto

When you look closely at the differences between the manifestos you will notice there are some common themes across the manifestos from 2016 to those in the 2020 manifestos. It is also interesting to see some new words appearing/disappearing for the 2020 manifestos. Some of these are a little surprising and interesting.

#GE2020 Analysing Party Manifestos using Python

Posted on

The general election is underway here in Ireland with polling day set for Saturday 8th February. All the politicians are out campaigning and every day the various parties are looking for publicity on whatever the popular topic is for that day. Each day is it a different topic.

Most of the political parties have not released their manifestos for the #GE2020 election (as of date of this post). I want to use some simple Python code to perform some analyse of their manifestos. As their new manifestos weren’t available (yet) I went looking for their manifestos from the previous general election. Michael Pidgeon has a website with party manifestos dating back to the early 1970s, and also has some from earlier elections. Check out his website.

I decided to look at manifestos from the 4 main political parties from the 2016 general election. Yes there are other manifestos available, and you can use the Python code, given below to analyse those, with only some minor edits required.

The end result of this simple analyse is a WordCloud showing the most commonly used words in their manifestos. This is graphical way to see what some of the main themes and emphasis are for each party, and also allows us to see some commonality between the parties.

Let’s begin with the Python code.

1 – Initial Setup

There are a number of Python Libraries available for processing PDF files. Not all of them worked on all of the Part Manifestos PDFs! It kind of depends on how these files were generated. In my case I used the pdfminer library, as it worked with all four manifestos. The common library PyPDF2 didn’t work with the Fine Gael manifesto document.

import io
import pdfminer
from pprint import pprint
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage

#directory were manifestos are located
wkDir = '.../General_Election_Ire/'

#define the names of the Manifesto PDF files & setup party flag
pdfFile = wkDir+'FGManifesto16_2.pdf'
party = 'FG'
#pdfFile = wkDir+'Fianna_Fail_GE_2016.pdf'
#party = 'FF'
#pdfFile = wkDir+'Labour_GE_2016.pdf'
#party = 'LB'
#pdfFile = wkDir+'Sinn_Fein_GE_2016.pdf'
#party = 'SF'

All of the following code will run for a given manifesto. Just comment in or out the manifesto you are interested in. The WordClouds for each are given below.

2 – Load the PDF File into Python

The following code loops through each page in the PDF file and extracts the text from that page.

I added some addition code to ignore pages containing the Irish Language. The Sinn Fein Manifesto contained a number of pages which were the Irish equivalent of the preceding pages in English. I didn’t want to have a mixture of languages in the final output.

SF_IrishPages = [14,15,16,17,18,19,20,21,22,23,24]
text = ""

pageCounter = 0
resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle)
page_interpreter = PDFPageInterpreter(resource_manager, converter)

for page in PDFPage.get_pages(open(pdfFile,'rb'), caching=True, check_extractable=True):
    if (party == 'SF') and (pageCounter in SF_IrishPages):
        print(party+' - Not extracting page - Irish page', pageCounter)
    else:
        print(party+' - Extracting Page text', pageCounter)
        page_interpreter.process_page(page)

        text = fake_file_handle.getvalue()

    pageCounter += 1

print('Finished processing PDF document')
converter.close()
fake_file_handle.close()
FG - Extracting Page text 0
FG - Extracting Page text 1
FG - Extracting Page text 2
FG - Extracting Page text 3
FG - Extracting Page text 4
FG - Extracting Page text 5
...

3 – Tokenize the Words

The next step is to Tokenize the text. This breaks the text into individual words.

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
tokens = []

tokens = word_tokenize(text)

print('Number of Pages =', pageCounter)
print('Number of Tokens =',len(tokens))
Number of Pages = 140
Number of Tokens = 66975

4 – Filter words, Remove Numbers & Punctuation

There will be a lot of things in the text that we don’t want included in the analyse. We want the text to only contain words. The following extracts the words and ignores numbers, punctuation, etc.

#converts to lower case, and removes punctuation and numbers
wordsFiltered = [tokens.lower() for tokens in tokens if tokens.isalpha()]
print(len(wordsFiltered))
print(wordsFiltered)
58198
['fine', 'gael', 'general', 'election', 'manifesto', 's', 'keep', 'the', 'recovery', 'going', 'gaelgeneral', 'election', 'manifesto', 'foreward', 'from', 'an', 'taoiseach', 'the', 'long', 'term', 'economic', 'three', 'steps', 'to', 'keep', 'the', 'recovery', 'going', 'agriculture', 'and', 'food', 'generational',
...

As you can see the number of tokens has reduced from 66,975 to 58,198.

5 – Setup Stop Words

Stop words are general words in a language that doesn’t contain any meanings and these can be removed from the data set. Python NLTK comes with a set of stop words defined for most languages.

#We initialize the stopwords variable which is a list of words like 
#"The", "I", "and", etc. that don't hold much value as keywords
stop_words = stopwords.words('english')
print(stop_words)
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself',
....

Additional stop words can be added to this list. I added the words listed below. Some of these you might expect to be in the stop word list, others are to remove certain words that appeared in the various manifestos that don’t have a lot of meaning. I also added the name of the parties  and some Irish words to the stop words list.

#some extra stop words are needed after examining the data and word cloud
#these are added
extra_stop_words = ['ireland','irish','ł','need', 'also', 'set', 'within', 'use', 'order', 'would', 'year', 'per', 'time', 'place', 'must', 'years', 'much', 'take','make','making','manifesto','ð','u','part','needs','next','keep','election', 'fine','gael', 'gaelgeneral', 'fianna', 'fáil','fail','labour', 'sinn', 'fein','féin','atá','go','le','ar','agus','na','ár','ag','haghaidh','téarnamh','bplean','page','two','number','cothromfor']
stop_words.extend(extra_stop_words)
print(stop_words)

Now remove these stop words from the list of tokens.

# remove stop words from tokenised data set
filtered_words = [word for word in wordsFiltered if word not in stop_words]
print(len(filtered_words))
print(filtered_words)
31038
['general', 'recovery', 'going', 'foreward', 'taoiseach', 'long', 'term', 'economic', 'three', 'steps', 'recovery', 'going', 'agriculture', 'food',

The number of tokens is reduced to 31,038

6 – Word Frequency Counts

Now calculate how frequently these words occur in the list of tokens.

#get the frequency of each word
from collections import Counter

# count frequencies
cnt = Counter()
for word in filtered_words:
cnt[word] += 1

print(cnt)
Counter({'new': 340, 'support': 249, 'work': 190, 'public': 186, 'government': 177, 'ensure': 177, 'plan': 176, 'continue': 168, 'local': 150, 
...

7 – WordCloud

We can use the word frequency counts to add emphasis to the WordCloud. The more frequently it occurs the larger it will appear in the WordCloud.

#create a word cloud using frequencies for emphasis 
from wordcloud import WordCloud
import matplotlib.pyplot as plt

wc = WordCloud(max_words=100, margin=9, background_color='white',
scale=3, relative_scaling = 0.5, width=500, height=400,
random_state=1).generate_from_frequencies(cnt)

plt.figure(figsize=(20,10))
plt.imshow(wc)
#plt.axis("off")
plt.show()

#Save the image in the img folder:
wc.to_file(wkDir+party+"_2016.png")

The last line of code saves the WordCloud image as a file in the directory where the manifestos are located.

8 – WordClouds for Each Party

Screenshot 2020-01-21 11.10.25

Remember these WordClouds are for the manifestos from the 2016 general election.

When the parties have released their manifestos for the 2020 general election, I’ll run them through this code and produce the WordClouds for 2020. It will be interesting to see the differences between the 2016 and 2020 manifesto WordClouds.

Responsible AI: Principles & Standards around the World

Posted on Updated on

During 2019 there was been a increase awareness of AI and the need for Responsible AI. During 2020 (and beyond) we will see more and more on this topic. To get you started on some of the details and some background reading, here are links to various Principles and Standards for Responsible AI from around the World.

Standard/Principles Description
EU AI Ethics Guidelines T​he Ethics Guidelines for Trustworthy Artificial Intelligence developed by EU High-Level Expert Group on AI ​highlights that trustworthy AI should be lawful, ethical and robust. Puts forward seven key requirements for AI systems should meet in order to be deemed trustworthy, including among others ​diversity, non-discrimination, societal and environmental well-being, transparency and accountability.
OECD principles on Artificial Intelligence OECD’s ​member​ countries along with partner countries adopted the first ever set of intergovernmental policy guidelines on ​AI​, agreeing to uphold international standards ​that aim to ensure AI systems are ​designed in a way that respects the rule of law, human rights, democratic values and diversity. They emphasize that AI should benefit people and the planet by driving inclusive growth, sustainable development and well-being.
CoE: Human Rights impacts of Algorithms Council of Europe draft recommendation on the human rights impacts of algorithmic AI systems, released for consultation in August 2019 and to be adopted in early 2020. ​The document explicitly refers to the UN Guiding Principles on Business and Human Rights as a guidance for due diligence process and ​Human ​Rights Impact Assessments.
IEEE Global Initiative: ​Ethically Aligned Design Ethically Aligned Design (EAD) Document is created ​​to ​educate a broader public ​and to inspire ​academics, engineers, policy makers and manufacturers of autonomous and intelligent systems​ to take action ​on prioritiz​ing ethical considerations​​.​ The general principles for AI design, manufacturing and use include: human rights, wellbeing, ​data agency, effectiveness, transparency, accountability, awareness of misuse, competence. ​The unique IEEE P7000 Standards series address specific issues at the intersection of technology and ethics​ and aimed to empower innovation across borders and enable societal benefit.
UN Sustainable Development Goals The UN Sustainable Goals include the annual ​AI for Good Global Summit is the leading UN platform for global and inclusive dialogue on how artificial intelligence could help accelerate progress towards the ​​Global Goals.
UN Business ​and Human Rights The UN Guiding Principles on Business and Human Rights (UNGPs)gives a framework offering a roadmap​ to navigate responsibility-related challenges, rapid ​technological disruption and rising ​inequality, business has a ​unique opportunity ​to implement​ human-centered innovation by taking into account ​social, ethical​ and human rights implications of AI.
EU Collaborative Platforms and Social Learning Several EU countries have ​​articulated their ambitions related to artificial intelligence, it is of paramount importance to find your unique voice, ​​track and join ​essential conversations, strategically engage in collective efforts and leave meaningful digital footprint.​

Machine Learning Evaluation Measures

Posted on Updated on

When developing machine learning models there is a long list of possible evaluation measures. On one hand this can be good as it gives us lots of insights into the models and be able to select the best model that meets the requirements. (BTW this is different to choosing the best model based on the evaluation measures!). On the other hand it can be very confusing what all of these mean as there can appear to be so many of them.  In this post I’ll look at some of these evolution measures.

I’m not going to go into the basic set of evaluation measures that come from the typical use of the Confusion Matrix, including True/False Positives, True/False Negatives, Accuracy, Miss-classification rate, Precision, Recall, Sensitivity and F1 score.

The following evaluation measures will be discussed:

  • R-Squared (R2)
  • Mean Squared Error (MSE)
  • Sum of Squared Error (SSE)
  • Root Mean Square (RMSE)

R-Squared (R²)

R-squared measures how well your data fits a regression line. It measures the variation of the predicted values, from the model, from that of the actual value. It is typically given as a percentage or in the range of Zero to One (although you can have negative values). It is also known as Coefficient of Determination. The higher the value for R² the better.

R² is always between 0 and 100%:

  • 0% indicates that the model explains none of the variability of the response data around its mean.
  • 100% indicates that the model explains all the variability of the response data around its mean.

r2

But R² cannot determine whether the coefficient estimates and predictions are biased

Mean Squared Error (MSE)

MSE  measures average squared error of our predictions. For each point, it calculates square difference between the predictions and the target and then average those values. The higher this value, the worse the model is.

Screenshot 2019-12-20 11.20.14

The larger the number the larger the error. Error in this case means the difference between the observed values and the predicted values. Square each difference, this ensures negative and positive values do not cancel each other out.

Sum of Squared Error (SSE)

SSE is the sum of the squared differences between each observation and its group’s mean.  It measures the overall difference between your data and the values predicted by your estimation model.

Screenshot 2019-12-20 11.28.40

Root Mean Square Error (RMSE) 

RMSE is just the square root of MSE. The square root is introduced to make scale of the errors to be the same as the scale of targets. As the square root of a variance, RMSE can be interpreted as the standard deviation of the unexplained variance. Lower values of RMSE indicate better fit. RMSE is a good measure of how accurately the model predicts the response.

 

 

Applying a Machine Learning Model in OAC

Posted on Updated on

There are a number of different tools and languages available for machine learning projects. One such tool is Oracle Analytics Cloud (OAC).  Check out my article for Oracle Magazine that takes you through the steps of using OAC to create a Machine Learning workflow/dataflow.

Screenshot 2019-12-19 14.31.24

Oracle Analytics Cloud provides a single unified solution for analyzing data and delivering analytics solutions to businesses. Additionally, it provides functionality for processing data, allowing for data transformations, data cleaning, and data integration. Oracle Analytics Cloud also enables you to build a machine learning workflow, from loading, cleaning, and transforming data and creating a machine learning model to evaluating the model and applying it to new data—without the need to write a line of code. My Oracle Magazine article takes you through the various tasks for using Oracle Analytics Cloud to build a machine learning workflow.

That article covers the various steps with creating a machine learning model. This post will bring you through the steps of using that model to score/label new data.

In the Data Flows screen (accessed via Data->Data Flows) click on Create. We are going to create a new Data Flow to process the scoring/labeling of new data.

Screenshot 2019-12-19 15.08.39

Select Data Flow from the pop-up menu. The ‘Add Data Set’ window will open listing your available data sets. In my example, I’m going to use the same data set that I used in the Oracle Magazine article to build the model.  Click on the data set and then click on the Add button.

Screenshot 2019-12-19 15.14.44

The initial Data Flow will be created with the node for the Data Set. The screen will display all the attributes for the data set and from this you can select what attributes to include or remove. For example, if you want a subset of the attributes to be used as input to the machine learning model, you can select these attributes at this stage. These can be adjusted at a later stages, but the data flow will need to be re-run to pick up these changes.

Screenshot 2019-12-19 15.17.48

Next step is to create the Apply Model node. To add this to the data flow click on the small plus symbol to the right of the Data Node. This will pop open a window from which you will need to select the Apply Model.

Screenshot 2019-12-19 15.22.40

A pop-up window will appear listing the various machine learning models that exist in your OAC environment. Select the model you want to use and click the Ok button.

Screenshot 2019-12-19 15.24.42

Screenshot 2019-12-19 15.25.22

The next node to add to the data flow is to save the results/outputs from the Apply Model node. Click on the small plus icon to the right of the Apply Model node and select Save Results from the popup window.

Screenshot 2019-12-19 15.27.50.png

We now have a completed data flow. But before you finish edit the Save Data node to give a name for the Save Data Set, and you can edit what attributes/features you want in the result set.

Screenshot 2019-12-19 15.30.25.png

You can now save and run the Data Flow, and view the outputs from applying the machine learning model. The saved data set results can be viewed in the Data menu.

Screenshot 2019-12-19 15.35.11

 

R (ROracle) and Oracle DATE formats

Posted on Updated on

When you comes to working with R to access and process your data there are a number of little features and behaviors you need to look out for.

One of these is the DATE datatype.

The main issue that you have to look for is the TIMEZONE conversion that happens then you extract the data from the database into your R environment.

There is a datatype conversions from the Oracle DATE into the POSIXct format. The POSIXct datatype also includes the timezone. But the Oracle DATE datatype does not have a Timezone part of it.

When you look into this a bit more you will see that the main issue is what Timezone your R session has. By default your R session will inherit the OS session timezone. For me here in Ireland we have the time timezone as the UK. You would time that the timezone would therefore be GMT. But this is not the case. What we have for timezone is BST (or British Standard Time) and this takes into account the day light savings time. So on the 26th May, BST is one hour ahead of GMT.

OK. Let’s have a look at a sample scenario.

The Problem

As mentioned above, when I select date of type DATE from Oracle into R, using ROracle, I end up getting a different date value than what was in the database. Similarly when I process and store the data.

The following outlines the data setup and some of the R code that was used to generate the issue/problem.

Data Set-up
Create a table that contains a DATE field and insert some records.

CREATE TABLE STAFF
(STAFF_NUMBER VARCHAR2(20),
FIRST_NAME VARCHAR2(20),
SURNAME VARCHAR2(20),
DOB DATE,
PROG_CODE VARCHAR2(6 BYTE),
PRIMARY KEY (STAFF_NUMBER));

insert into staff values (123456789, 'Brendan', 'Tierney', to_date('01/06/1975', 'DD/MM/YYYY'), 'DEPT_1');
insert into staff values (234567890, 'Sean', 'Reilly', to_date('21/10/1980', 'DD/MM/YYYY'), 'DEPT_2');
insert into staff values (345678901, 'John', 'Smith', to_date('12/03/1973', 'DD/MM/YYYY'), 'DEPT_3');
insert into staff values (456789012, 'Barry', 'Connolly', to_date('25/01/1970', 'DD/MM/YYYY'), 'DEPT_4');

You can query this data in SQL without any problems. As you can see there is no timezone element to these dates.

Selecting the data
I now establish my connection to my schema in my 12c database using ROracle. I won’t bore you with the details here of how to do it but check out point 3 on this post for some details.

When I select the data I get the following.

> res<-dbSendQuery(con, "select * from staff")
> data <- fetch(res)
> data$DOB
[1] "1975-06-01 01:00:00 BST" "1980-10-21 01:00:00 BST" "1973-03-12 00:00:00 BST"
[4] "1970-01-25 01:00:00 BST"

As you can see two things have happened to my date data when it has been extracted from Oracle. Firstly it has assigned a timezone to the data, even though there was no timezone part of the original data. Secondly it has performed some sort of timezone conversion to from GMT to BST. The difference between GMT and BTS is the day light savings time. Hence the 01:00:00 being added to the time element that was extract. This time should have been 00:00:00. You can see we have a mixture of times!

So there appears to be some difference between the R date or timezone to what is being used in Oracle.

To add to this problem I was playing around with some dates and different records. I kept on getting this scenario but I also got the following, where we have a mixture of GMT and BST times and timezones. I’m not sure why we would get this mixture.

> data$DOB
[1] "1995-01-19 00:00:00 GMT" "1965-06-20 01:00:00 BST" "1973-10-20 01:00:00 BST"
[4] "2000-12-28 00:00:00 GMT"

This is all a bit confusing and annoying. So let us look at how you can now fix this.

The Solution

Fixing the problem : Setting Session variables
What you have to do to fix this and to ensure that there is consistency between that is in Oracle and what is read out and converted into R (POSIXct) format, you need to define two R session variables. These session variables are used to ensure the consistency in the date and time conversions.

These session variables are TZ for the R session timezone setting and Oracle ORA_SDTZ setting for specifying the timezone to be used for your Oracle connections.

The trick there is that these session variables need to be set before you create your ROracle connection. The following is the R code to set these session variables.

> Sys.setenv(TZ = "GMT")

> Sys.setenv(ORA_SDTZ = "GMT")

So you really need to have some knowledge of what kind of Dates you are working with in the database and if a timezone if part of it or is important. Alternatively you could set the above variables to UDT.

Selecting the data (correctly this time)
Now when we select our data from our table in our schema we now get the following, after reconnecting or creating a new connection to your Oracle schema.

> data$DOB

[1] "1975-06-01 GMT" "1980-10-21 GMT" "1973-03-12 GMT" "1970-01-25 GMT"

Now you can see we do not have any time element to the dates and this is correct in this example. So all is good.

We can now update the data and do whatever processing we want with the data in our R script.

But what happens when we save the data back to our Oracle schema. In the following R code we will add 2 days to the DOB attribute and then create a new table in our schema to save the updated data.

> data$DOB

[1] "1975-06-01 GMT" "1980-10-21 GMT" "1973-03-12 GMT" "1970-01-25 GMT"

> data$DOB <- data$DOB + days(2)
> data$DOB
[1] "1975-06-03 GMT" "1980-10-23 GMT" "1973-03-14 GMT" "1970-01-27 GMT"

 

> dbWriteTable(con, "STAFF_2", data, overwrite = TRUE, row.names = FALSE)
[1] TRUE

I’ve used the R package Libridate to do the date and time processing.

When we look at this newly created table in our Oracle schema we will see that we don’t have DATA datatype for DOB, but instead it is created using a TIMESTAMP data type.

If you are working with TIMESTAMP etc type of data types (i.e. data types that have a timezone element that is part of it) then that is a slightly different problem.

Data Profiling in Python

Posted on Updated on

With every data analytics and data science project, one of the first tasks to that everyone needs to do is to profile the data sets. Data profiling allows you to get an initial picture of the data set, see data distributions and relationships. Additionally it allows us to see what kind of data cleaning and data transformations are necessary.

Most data analytics tools and languages have some functionality available to help you. Particular the various data science/machine learning products have this functionality built-in them and can do a lot of the data profiling automatically for you. But if you don’t use these tools/products, then you are probably using R and/or Python to profile your data.

With Python you will be working with the data set loaded into a Pandas data frame. From there you will be using various statistical functions and graphing functions (and libraries) to create a data profile. From there you will probably create a data profile report.

But one of the challenges with doing this in Python is having different coding for handling numeric and character based attributes/features. The describe function in Python (similar to the summary function in R) gives some statistical summaries for numeric attributes/features. A different set of functions are needed for character based attributes. The Python Library repository (https://pypi.org/) contains over 200K projects. But which ones are really useful and will help with your data science projects. Especially with new projects and libraries being released on a continual basis? This is a major challenge to know what is new and useful.

For example the followings shows loading the titanic data set into a Pandas data frame, creating a subset and using the describe function in Python.

import pandas as pd

df = pd.read_csv("/Users/brendan.tierney/Dropbox/4-Datasets/titanic/train.csv")

df.head(5)

Screenshot 2019-11-22 16.58.39

df2 = df.iloc[:,[1,2,4,5,6,7,8,10,11]]
df2.head(5)

Screenshot 2019-11-22 16.59.30

df2.describe()

Screenshot 2019-11-22 17.00.17

You will notice the describe function has only looked at the numeric attributes.

One of those 200+k Python libraries is one called pandas_profiling. This will create a data audit report for both numeric and character based attributes. This most be good, Right?  Let’s take a look at what it does.

For each column the following statistics – if relevant for the column type – are presented in an interactive HTML report:

  • Essentials: type, unique values, missing values
  • Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
  • Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
  • Most frequent values
  • Histogram
  • Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
  • Missing values matrix, count, heatmap and dendrogram of missing values

The first step is to install the pandas_profiling library.

pip3 install pandas_profiling

Now run the pandas_profiling report for same data frame created and used, see above.

import pandas_profiling as pp

df2.profile_report()

The following images show screen shots of each part of the report. Click and zoom into these to see more details.

Screenshot 2019-11-22 17.29.00Screenshot 2019-11-22 17.29.46

Screenshot 2019-11-22 17.30.57Screenshot 2019-11-22 17.31.32

Screenshot 2019-11-22 17.31.57Screenshot 2019-11-22 17.32.31

Screenshot 2019-11-22 17.33.02