Setting up Julia to work with Oracle Database

Posted on Updated on

For Data Science projects the top three languages every data scientist and machine learning practitioner knows are Python, R and SQL. The ranking or order of importance of these is of some debate and the reason answer is, ‘It Depends’. But one thing is for sure no matter what your environment, SQL skills will be needed, because that’s where the data lives, in the various databases of the organization. No matter what the database is SQL is the way to access and analyze it efficiently. But for Python and R, the popularity of these languages really depends on the project team and their background. Deciding between the two can come down to flipping a coin. But every has their favorite!

A (or not so) new language for data science and machine learning is Julia. Actually it has been around for a while now, and life began on it in 2009, whereas R (and S) and Python have their beginnings back in the 1980’s and early 1990’s. Does that make them legacy programming languages? or it just took a bit of time to mature and gain popularity?

There are lots of advantages to Julia, just like there are lots of advantages with the other languages. The following diagram illustrates one of the core advantages of Julia, it isn’t an interpreted language like R and Python, which means Julia will be significantly faster, yet still allows interactive development using Notebooks, just like R and Python. Julia was designed and build for data science and machine learning, and is designed for scale which makes it a good fit for MLOps. The list of advantages and differences can go on a bit and those are not the point of this post.

The remainder of this post will step through what is needed to get Julia working with an Oracle Database, and you have setup an IDE. Check out the Julia website for excellent installation instructions and selecting an IDE. If you coming from an R and/or Python background, using Jupyter Notebooks is a good option, and as you become more experienced there are a number of more advanced IDEs available for you to use. I’m assuming you have installed Julia.

If you have done a new install of Julia, make sure to add the install directory to the search PATH.

First Download load and install Oracle Instant Client. This is needed by the Julia packages to communicate with Oracle Database. After installing make sure to setup the following in your environment (environment variables and Path)

  • ORACLE_HOME : points to where you installed Oracle Instant Client
  • TNS_ADMIN : points to the directory containing the wallet/tnsnames files. This will be a sub-directory in Oracle Instant Client directory, for example, it points to  …/instantclient_19_8/network/admin
  • PATH : include the Oracle Instant Client install directory in the PATH.

Next step is to setup the Oracle Client network files. As your DBA for the tnsnames.ora file or for the Wallet Zip file for your database. The Wallet Zip file is the most common approach.  Unzip this Wallet file and copy the unzipped files to the TNS_ADMIN directory. See the second bullet point above to for this (…/instantclient_19_8/network/admin).

That’s all you need to do on the Oracle setup.  I’m assuming you have a username and password for the Oracle Database you will be using.

Now we can setup Julia to use the Oracle Instant Client software.  It is important you have setup those environment variables l’ve listed above.

There is an Oracle.jl package, developed by Felipe Noronha, which runs on top of Oracle Instant Client. To install this, load the Pkg package then then add the Oracle package. The following shows these commands and part of the output from the installation.

julia> using Pkg

julia> Pkg.add("Oracle")
Updating registry at `~/.julia/registries/General`
######################################################################## 100.0%
Resolving package versions...
Installed Reexport ──────────────────── v1.0.0
Installed libsodium_jll ─────────────── v1.0.18+1
Installed Compat ────────────────────── v3.25.0
Installed OrderedCollections ────────── v1.3.3
Installed WebSockets ────────────────── v1.5.9
Installed JuliaInterpreter ──────────── v0.8.8
Installed DataStructures ────────────── v0.18.9
Installed DataAPI ───────────────────── v1.5.1
Installed Requires ──────────────────── v1.1.2
Installed DataValueInterfaces ───────── v1.0.0
Installed Parsers ───────────────────── v1.0.15
Installed FlameGraphs ───────────────── v0.2.5
Installed URIs ──────────────────────── v1.2.0
Installed Colors ────────────────────── v0.12.6
Installed Oracle ────────────────────── v0.2.0
...
...
...
[7240a794] + Oracle v0.2.0
[bac558e1] ↑ OrderedCollections v1.3.2 ⇒ v1.3.3
[69de0a69] ↑ Parsers v1.0.12 ⇒ v1.0.15
[189a3867] ↑ Reexport v0.2.0 ⇒ v1.0.0
[ae029012] ↑ Requires v1.1.1 ⇒ v1.1.2
[3783bdb8] + TableTraits v1.0.0
[bd369af6] + Tables v1.3.2
[0796e94c] ↑ Tokenize v0.5.8 ⇒ v0.5.13
[5c2747f8] + URIs v1.2.0
[104b5d7c] ↑ WebSockets v1.5.2 ⇒ v1.5.9
[8f1865be] ↑ ZeroMQ_jll v4.3.2+5 ⇒ v4.3.2+6
[a9144af2] + libsodium_jll v1.0.18+1
Building Oracle → `~/.julia/packages/Oracle/CEOWz/deps/build.log`

julia>

You are now ready to load this Oracle package and use it to connect to an Oracle Database. Setting up a connection is really simple and in the following example I’m connecting to an ATP Database on Oracle Free Tier. The following sets up some variables, creates a connection, prints a statement and connection information and then closes the connection.

import Oracle

username="oml_user"
password="xxxxxxxxxxx"
dbname="yyyyyyyyyyyy"

conn = Oracle.Connection(username, password, dbname)

println("Connected")
println(conn)

Oracle.close(conn)

Job done 🙂

There is little additional connection information available. To test the connection a bit more let’s list what tables I have in my test/demo schema/user.

import Oracle

username="oml_user"
password="xxxxxxxxxxx"
dbname="yyyyyyyyyyyy"

conn = Oracle.Connection(username, password, dbname)

println("Tables")
println("--------------------")

Oracle.query(conn, "SELECT table_name FROM user_tables") do cursor
    for row in cursor
    # row values can be accessed using column name or position
        println( row["TABLE_NAME"] ) # same as row[1]
    end
end

println("")
println("...the end...")

Oracle.close(conn)

If you come from a Python background the syntax is familiar which makes the move other to Julia an easier task.

One other difference is, running the above code does seem to run a lot quicker in Julia. I haven’t measured it and the difference is less than a second but it is noticeable.  For me, the above code generate the following output,

Tables
--------------------
WINE
BANK_ADDITIONAL_FULL
MINING_DATA_BUILD_V

...the end...

I’ll have additional posts looking are difference aspects and commands for working with and processing data in an Oracle Database.

Collection of Oracle 21c posts on new Machine Learning and Statistical functions

Posted on Updated on

Oracle 21c was officially released a few days about and this post contains links to some blog posts I’ve written on new machine learning and statistical functions in the new Oracle 21c.

I also have posts on new OML4Py and AutoML too, and I’ll have a different set of posts for those, so look out them.

Also check out my previous blog post that covers new machine learning feature introduced in Oracle 19c.

Measuring Kurtosis of Data in Oracle (21c)

Posted on Updated on

Kurtosis is a new analytics function in Oracle 21c (20c) and is one of a set of commonly used statistical functions used to evaluate data to see and understand the behavior of the data.

[See my previous post where I give examples of the new Skewness functions]

Kurtosis is the measurement of the tails of the data distribution and its comparison with that of normal distribution. The Kurtosis of the normal distribution is said to be 3. To make interpenetrating results easier (a Zero) kurtosis measure for gaussian/normal distribution by subtracting 3 from its value, this is called Excess Kurtosis. Kurtosis can be used to describe the height or the breath of the distributions, when compared to a normal distributions, although this is not theoretically correct, it gives a simpler explanation and visualization of it. The following diagram gives an example of a normal distribution, a plot of Positive Kurtosis and Negative Kurtosis.

Prior to the new Kurtosis SQL functions (KURTOSIS_POP and KURTOSIS_SAMP), you had to calculate the Kurtosis value manually using something like the following SQL. These use the same data and attributes set used for the Skewness examples.

select avg(KV) K_value
from (select power((age - avg(age) over ())/stddev(age) over (), 4) KV 
      from cust_data)
union all
select avg(KV) K_value
from (select power((duration - avg(duration) over ())/stddev(duration) over (), 4) KV 
      from cust_data);

K_value
------------------------------------------
3.79088571963003808388287765230733611415
23.24420570926391173498028369605428048285

These don’t include the subtraction of 3 to give a zero kurtosis, and these values can be compared to the data distribution charts shown in the Skewness post.

Now with the new Kurtosis functions it simplifies the tasks of getting these values.

SELECT kurtosis_pop(age), kurtosis_samp(age) 
FROM bank_additional
union all
SELECT kurtosis_pop(duration), kurtosis_samp(duration) 
FROM bank_additional;

KURTOSIS_POP         KURTOSIS_SAMP
------------------   -----------------------------------------
 0.791069803527387    0.79131153115443467194451597661213420763
20.245334438614832   20.24793801497878942299945619307526969226

As you can see the Kurtosis function have the subtraction include.

As with the Skewness functions, the SAMP version works on a sample of the data values and as the number inputs increases, and differences between the POP and SAMP will reduce.

 

 

 

 

 

 

Measuring Skewness of Data in Oracle (21c)

Posted on Updated on

When analyzing data you will look at using a variety of different statistical functions to explore variable data insights.

One of these is the Skewness of the data.

Skewness is a measure of the asymmetry of the probability distribution about its mean. This looks a the tail of the data, with a positive value indicating the tail on the right side of the distribution, and a negative value when the tail is on the left hand side. A zero value indicates the tails on both side balance out, as shown in the following image.

Depiction of positive skewness, skewness, and negative skewness.

Most SQL dialects support Skewness using with an inbuilt function. But if it doesn’t then you would need to write your own version of the calculation, for example using the following.

SELECT avg(SV) S_value
FROM   (SELECT  power((age – avg(age) over ())/stddev(age) over (), 3) SV
        FROM    cust_data)

Here are charts illustrating the data in my table. These include the distributions for the AGE and DURATION attributes.

We can see the data is skewed. When we run the above code we get the following values.

Age = 0.78

Duration = 3.26

We can see the skewness of Duration is significantly longer, giving a positive value as the skewness is to the right.

In Oracle 21s we now have new Skewness functions called SKEWNESS_POP and SKEWNESS_SAMP.  The POP version of the function considers all records, where as the SAMP function considers a sample of the records. When your data set grows into many millions of records the SKEWNESS_SAMP will give a quicker response as it works with a sample of the data set

Both functions will give similar values but at the number of input records the returned values will returned will converge.

SELECT skewness_pop(age), skewness_samp(age) 
FROM cust_data;
SELECT skewness_pop(duration), skewness_samp(duration) 
FROM cust_data;

2020 Books on Data Science and Machine Learning

Posted on Updated on

2020 has been an interesting year. Not for the obvious topic, but for new books on Data Science and Machine Learning. The list below are some of my favorite books from 2020. Making the selection was difficult. Some months had a large number of releases and some were a bit quieter. The books below are listed based on their release date and are not ranked in any way. I’ve included links to these books on Amazon (.com, .uk and .de).

January

Everyone wants to work in Data Science, but where and how do you start. Aimed at beginners with guidance without the technical. High level, not for everyone.

amazon.com amazon.co.uk amazon.de

February

Taking ML to the next stage creating AI application. How to do it with examples across a number of areas.

amazon.com amazon.co.uk amazon.de

March

A guide for those wary of impact of technology’s and for those who are enthusiastic about where AI is taking us.

amazon.com amazon.co.uk amazon.de

April

AI Ethics was one of the topic topics for 2020. Covers the philosophical aspects along with the technical one.s

amazon.com amazon.co.uk amazon.de

May

Covering the life-cyle of building ML application, showing all that it entails and how ML plays a small part in the overall solution

amazon.com amazon.co.uk amazon.de

June

From covering the basics of NLP, it builds on this to include in application, how to use in different industries and within project teams.

amazon.com amazon.co.uk amazon.de

July

With by Thomas Davenport and others, and is a good addition to his other books. Consisting of interviews, research and analysis on how to win with ML & AI.

amazon.com amazon.co.uk amazon.de

August

I was invited to contribute a couple of chapters to this book, along with well known names in areas of DS, ML & AI

amazon.com amazon.co.uk amazon.de

September

Building upon the success of their 1st edition, the 2nd edition comes with more example and extra chapters.

amazon.com amazon.co.uk amazon.de

October

ML & AI is not perfect. Lots can go wrong. Not just with the project but also with the implementation of the applications. Lots to thing about and consider.

amazon.com amazon.co.uk amazon.de

November

No one really builds ML algorithms. We build ML solutions and applications. But whats the best way to do this, from technical, organizational and ethical aspects.

amazon.com amazon.co.uk amazon.de

December

It was difficult to pick a book for this month. Lots of new releases and I haven’t received all my orders, at time of this post.

Here is a book from July, and is related to an Automated Trading App I’ve been working on (and earning) for a couple of years.

amazon.com  amazon.co.uk  amazon.de

And to finish off the list I’m including this additional book. It wasn’t released this year. It was released in April 2018. It was a best seller on Amazon in 2018 and 2019!  This was really exciting for us and we still amazed at how it it is still selling in 2020. It is currently, as of December 2020, listed in 8th place on the MIT Press Best Sellers list. It won’t be making any best seller list in 2020, but is still proving popular with many readers. To all of you who have bought this book, I’d like to say Thank You and wishing you all the best with 2021 and beyond.

Adding Text Processing to Classification Machine Learning in Oracle Machine Learning

Posted on Updated on

One of the typical machine learning functions is Classification. This is in widespread use across most domains and geographic regions. I’ve written several blog posts on this topic over many years (and going back many, many year) on how to do this using Oracle Machine Learning (OML) (formally known as Oracle Advanced Analytic and in the Oracle Data Miner tool in SQL Developer). Just do a quick search of my blog to find some of these posts.

When it comes to Classification problems, typically the data set will be contain your typical categorical and numerical variables/features. The Automatic Data Preparation (ADP) feature of OML where it automatically pre-processes and transforms these variable for input to the machine learning algorithm. This greatly reduces the boring work of the data scientist and increases their productivity.

But sometimes data sets come with text descriptions. These will contain production descriptions, free format text, and other descriptive data, for example product reviews. But how can this information be included as part of the input data set to the machine learning algorithms. Oracle allows this kind of input data, and a letting bit of setup is needed to tell Oracle how to process the data set. This uses the in-database feature of Oracle Text.

The following example walks through an example of the steps needed to pre-process and include the text processing as part of the machine learning algorithm.

The data set: The data used to illustrate this and to show the steps needed, is a data set from Kaggle webiste. This data set contains 130K Wine Reviews. This data set contain descriptive information of the wine with attributes about each wine including country, region, number of points, price, etc as well as a text description contain a review of the wine.

The following are 2 files containing the DDL (to create the table) and then Import the data set (using sql script with insert statements). These can be run in your schema (in order listed below).

  1. Create table WINEREVIEWS_130K_IMP
  2. Insert records into WINEREVIEWS_130K_IMP table

I’ll leave the Data Exploration to you to do and to discover some early insights.

The ML Question

I want to be able to predict if a wine is a good quality wine, based on the prices and different characteristics of the wine?

Data Preparation

To be able to answer this question the first thing needed is to define a target variable to identify good and bad wines. To do this create a new attribute/feature called POINTS_BIN and populate it based on the number of points a wine has. If it has >90 points it is a good wine, if <90 points it is a bad wine.

ALTER TABLE WineReviews130K_bin ADD POINTS_BIN VARCHAR2(15);

UPDATE WineReviews130K_bin
SET POINTS_BIN = 'GT_90_Points'
WHERE winereviews130k_bin.POINTS >= 90;

UPDATE WineReviews130K_bin
SET POINTS_BIN = 'LT_90_Points'
WHERE winereviews130k_bin.POINTS < 90;

alter table WineReviews130K_bin DROP COLUMN POINTS;

The DESCRIPTION column data type needs to be changed to CLOB. This is to allow the Text Mining feature to work correctly.

-- add a new column of data type CLOB
ALTER TABLE WineReviews130K_bin ADD (DESCRIPTION_NEW CLOB);

-- update new column with data from the DESCRIPTION attribute
UPDATE WineReviews130K_bin SET DESCRIPTION_NEW = DESCRIPTION;

-- drop the DESCRIPTION attribute from table
ALTER TABLE WineReviews130K_bin DROP COLUMN DESCRIPTION;

-- rename the new attribute to replace DESCRIPTION
ALTER TABLE WineReviews130K_bin RENAME COLUMN DESCRIPTION_NEW TO DESCRIPTION;

 

Text Mining Configuration

There are a number of things we need to define for the Text Mining to work, these include a Lexer, Stop Word list and preferences.

First define the Lexer to use. In this case we will use a basic one and basic settings

BEGIN 
   ctx_ddl.create_preference('mylex', 'BASIC_LEXER'); 
   ctx_ddl.set_attribute('mylex', 'printjoins', '_-'); 
   ctx_ddl.set_attribute ( 'mylex', 'index_themes', 'NO'); 
   ctx_ddl.set_attribute ( 'mylex', 'index_text', 'YES'); 
END;

Next we can define a Stop Word List. Oracle Text comes with a predefined set of Stop Word lists for most of the common languages. You can add to one of those list or create your own. Depending on the domain you are working in it might be easier to create your own and it is very straight forward to do. For example:

DECLARE
   v_stoplist_name varchar2(100);
BEGIN
   v_stoplist_name := 'mystop';
   ctx_ddl.create_stoplist(v_stoplist_name, 'BASIC_STOPLIST'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'nonetheless');
   ctx_ddl.add_stopword(v_stoplist_name, 'Mr'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'Mrs'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'Ms'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'a'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'all'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'almost'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'also'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'although'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'an'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'and'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'any'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'are'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'as'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'at'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'be'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'because'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'been'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'both'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'but'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'by'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'can'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'could'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'd'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'did'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'do'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'does'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'either'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'for'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'from'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'had'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'has'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'have'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'having'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'he'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'her'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'here'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'hers'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'him'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'his'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'how'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'however'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'i'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'if'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'in'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'into'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'is'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'it'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'its'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'just'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'll'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'me'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'might'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'my'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'no'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'non'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'nor'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'not'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'of'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'on'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'one'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'only'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'onto'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'or'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'our'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'ours'); 
   ctx_ddl.add_stopword(v_stoplist_name, 's'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'shall'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'she'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'should'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'since'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'so'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'some'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'still'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'such'); 
   ctx_ddl.add_stopword(v_stoplist_name, 't'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'than'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'that'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'the'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'their'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'them'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'then'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'there'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'therefore'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'these'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'they'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'this'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'those'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'though'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'through'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'thus'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'to'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'too'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'until'); 
   ctx_ddl.add_stopword(v_stoplist_name, 've'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'very'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'was'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'we'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'were'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'what'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'when'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'where'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'whether'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'which'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'while'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'who'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'whose'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'why'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'will'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'with'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'would'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'yet'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'you'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'your'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'yours'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'drink');
   ctx_ddl.add_stopword(v_stoplist_name, 'flavors'); 
   ctx_ddl.add_stopword(v_stoplist_name, '2020');
   ctx_ddl.add_stopword(v_stoplist_name, 'now'); 
END;

Next define the preferences for processing the Text, for example what Stop Word list to use, if Fuzzy match is to be used and what language to use for this, number of tokens/words to process and if stemming is to be used.

BEGIN 
   ctx_ddl.create_preference('mywordlist', 'BASIC_WORDLIST');
   ctx_ddl.set_attribute('mywordlist','FUZZY_MATCH','ENGLISH'); 
   ctx_ddl.set_attribute('mywordlist','FUZZY_SCORE','1'); 
   ctx_ddl.set_attribute('mywordlist','FUZZY_NUMRESULTS','5000'); 
   ctx_ddl.set_attribute('mywordlist','SUBSTRING_INDEX','TRUE'); 
   ctx_ddl.set_attribute('mywordlist','STEMMER','ENGLISH'); 
END;

And the final step is to piece it all together by defining a new Text policy

BEGIN
   ctx_ddl.create_policy('my_policy', NULL, NULL, 'mylex', 'mystop', 'mywordlist');
END;

Define Settings for OML Model

We will create two models. An Attribute Importance model and a Classification model. The following defines the model parameters for each of these.

CREATE TABLE att_import_model_settings (setting_name varchar2(30), setting_value varchar2(30)); 
INSERT INTO att_import_model_settings (setting_name, setting_value)  
VALUES (''ALGO_NAME'', ''ALGO_AI_MDL'');
INSERT INTO att_import_model_settings (setting_name, setting_value) 
VALUES (''PREP_AUTO'', ''ON'');
INSERT INTO att_import_model_settings (setting_name, setting_value) 
VALUES (''ODMS_TEXT_POLICY_NAME'', ''my_policy'');
INSERT INTO att_import_model_settings (setting_name, setting_value) 
VALUES (''ODMS_TEXT_MAX_FEATURES'', ''3000'')';
CREATE TABLE wine_model_settings (setting_name varchar2(30), setting_value varchar2(30)); 
INSERT INTO wine_model_settings (setting_name, setting_value)  
VALUES (''ALGO_NAME'', ''ALGO_RANDOM_FOREST'');
INSERT INTO wine_model_settings (setting_name, setting_value) 
VALUES (''PREP_AUTO'', ''ON'');
INSERT INTO wine_model_settings (setting_name, setting_value) 
VALUES (''ODMS_TEXT_POLICY_NAME'', ''my_policy'');
INSERT INTO wine_model_settings (setting_name, setting_value) 
VALUES (''ODMS_TEXT_MAX_FEATURES'', ''3000'')';

Create the Training and Test data sets.

CREATE TABLE wine_train_data
AS SELECT id, country, description, designation, points_bin, price, province, region_1, region_2, taster_name, variety, title
FROM winereviews130k_bin 
SAMPLE (60) SEED (1);
CREATE TABLE wine_test_data
AS SELECT id, country, description, designation, points_bin, price, province, region_1, region_2, taster_name, variety, title
FROM winereviews130k_bin 
WHERE id NOT IN (SELECT id FROM wine_train_data);

All the set up is done, we can move onto the creating the machine learning models.

Create the OML Model (Attribute Importance & Classification)

We are going to create two models. The first is an Attribute Important model. This will look at the data set and will determine what attributes contribute most towards determining the target variable. As we are incorporting Texting Mining we will see what words/tokens from the DESCRIPTION attribute also contribute towards the target variable.

BEGIN
   DBMS_DATA_MINING.CREATE_MODEL(
      model_name          => 'GOOD_WINE_AI',
      mining_function     => DBMS_DATA_MINING.ATTRIBUTE_IMPORTANCE,
      data_table_name     => 'winereviews130k_bin',
      case_id_column_name => 'ID',
      target_column_name  => 'POINTS_BIN',
      settings_table_name => 'att_import_mode_settings');
END;

We can query the system views for Oracle ML to find out what are the important variables.

SELECT * FROM dm$vagood_wine_ai 
ORDER BY attribute_rank;

Here is the listing of the top 15 most important attributes. We can see from the first 15 rows and looking under column ATTRIBUTE_SUBNAME, the words from the DESCRIPTION attribute that seem to be important and contribute towards determining the value in the target attribute.

At this point you might determine, based on domain knowledge, some of these words should be excluded as they are generic for the domain. In this case, go back to the Stop Word List and recreate it with any additional words. This can be repeated until you are happy with the list. In this example, WINE could be excluded by including it in the Stop Word List.

Run the following to create the Classification model. It is very similar to what we ran above with minor changes to the name of the model, the data mining function and the name of the settings table.

BEGIN
   DBMS_DATA_MINING.CREATE_MODEL(
      model_name          => 'GOOD_WINE_MODEL',
      mining_function     => DBMS_DATA_MINING.CLASSIFICATION,
      data_table_name     => 'winereviews130k_bin',
      case_id_column_name => 'ID',
      target_column_name  => 'POINTS_BIN',
      settings_table_name => 'wine_model_settings');
END;

Apply OML Model

The model can be applied in similar ways to any other ML model created using OML. For example the following displays the wine details along with the predicted points bin values (good or bad) and the probability score (<=1) of the prediction.

SELECT id, price, country, designation, province, variety, points_bin, 
       PREDICTION(good_wine_mode USING *) pred_points_bin,
       PREDICTION_PROBABILITY(good_wine_mode USING *) prob_points_bin
FROM wine_test_data;

 

 

Enhanced Window Clause functionality in Oracle 21c (20c)

Posted on Updated on

Updated: Changed 20c to Oracle 21c, as Oracle 20c Database never really existed 🙂

The Oracle Database has had advanced analytical functions for some time now and with each release we get to have some new additions or some enhancements to existing functionality.

One new enhancement, available and documented in 21c (not yet released at time of writing this), is changing in the way the Window Clause can be defined for analytic functions. Oracle 21c is available on Oracle Cloud as a pre-release for evaluation purposes (but it won’t be available for much longer!). The examples shown below are based on using this 21c pre-release of the database.

NOTE: At this point, no one really knows when or if 20c will be released. I’m sure all the documented 20c new features will be rolled into 21c, whenever that will be released.

Before giving some examples of the new Window Clause functionality, lets have a quick recap on how we could use it up to now (up to 19c database). Here is a simple example of windowing the data by creating partitions based on the distinct values in DEPTNO column

select deptno,
      ename,
       job,
       salary,
       avg (salary) over (partition by DEPTNO) avg_sal
from employee
order by deptno;

 

 

 

 

 

 

 

 

 

 

 

 

Here we get to see the average salary being calculated for each window partition and being reset for the next windwo partition.

The SQL:2011 standard support the defining of the Window clause in the query block, after defining the list tables for the query. This allows us to define the window clause one and then reference this for analytic function that need it. The following example illustrate this. I’ve take the able query and altered it to have the newer syntax. I’ve highlighted the new or changed code in blue. In the analytic function, the w1 refers to the Window clause defined later, and is more in keeping with how a query is logically processed.

select deptno, 
ename,
sal,
sum(sal) over (w1) sum_sal
from emp
window w1 as (partition by deptno);


As you would expect we get the same results returned.

This newer syntax is particularly useful when we have many more analytic function in our queries, and some of these are using slightly different windowing. To me it makes it easier to read and to make edits, allowing an edit to be preformed once instead of for each analytic function, and avoids any errors. But making it easier to read and understand is by far the greatest benefit. Here is another example which uses different window clauses using the previous syntax.

SELECT deptno, 
ename,
sal,
AVG(sal) OVER (PARTITION BY deptno ORDER BY sal) AS avg_dept_sal,
AVG(sal) OVER (PARTITION BY deptno ) AS avg_dept_sal2,
SUM(sal) OVER (PARTITION BY deptno ORDER BY sal desc) AS sum_dept_sal
FROM emp;

Using the newer syntax this gets transformed into the following.

SELECT deptno, 
      ename,
      sal,
      AVG(sal) OVER (w1) AS avg_dept_sal,
AVG(sal) OVER (w2) AS avg_dept_sal2,
SUM(sal) OVER (w2) AS avg_dept_sal
FROM emp
window w1 as (PARTITION BY deptno ORDER BY sal),
w2 as (PARTITION BY deptno),
w3 as (PARTITION BY deptno ORDER BY sal desc);

Exploring Database trends using Python pytrends (Google Trends)

Posted on

A little word of warning before you read the rest of this post. The examples shown below are just examples of what is possible. It isn’t very scientific or rigorous, so don’t come complaining if what is shown doesn’t match your knowledge and other insights. This is just a little fun to see what is possible. Yes a more rigorous scientific study is needed, and some attempts at this can be seen at DB-Engines.com. Less scientific are examples shown at TOPDB Top Database index and that isn’t meant to be very scientific.

After all of that, here we go 🙂

pytrends is a library providing an API to Google Trends using Python. The following examples show some ways you can use this library and the focus area I’ll be using is Databases. Many of you are already familiar with using Google Trends, and if this isn’t something you have looked at before then I’d encourage you to go have a look at their website and to give it a try. You don’t need to run Python to use it. For example, here is a quick example taken from the Google Trends website. Here are a couple of screen shots from Google Trends, comparing Relational Database to NoSQL Database. The information presented is based on what searches have been performed over the past 12 months. Some of the information is kind of interesting when you look at the related queries and also the distribution of countries.

To install pytrends use the pip command

pip3 install pytrends

As usual it will change the various pendent libraries and will update where necessary. In my particular case, the only library it updated was the version of pandas.

You do need to be careful of how many searches you perform as you may be limited due to Google rate limits. You can get around this by using a proxy and there is an example on the pytrends PyPi website on how to get around this.

The following code illustrates how to import and setup an initial request. The pandas library is also loaded as the data returned by pytrends API into a pandas dataframe. This will make it ease to format and explore the data.

import pandas as pd 
from pytrends.request import TrendReq

pytrends = TrendReq()

The pytrends API has about nine methods. For my example I’ll be using the following:

  • Interest Over Time: returns historical, indexed data for when the keyword was searched most as shown on Google Trends’ Interest Over Time section.
  • Interest by Region: returns data for where the keyword is most searched as shown on Google Trends’ Interest by Region section.
  • Related Queries: returns data for the related keywords to a provided keyword shown on Google Trends’ Related Queries section.
  • Suggestions: returns a list of additional suggested keywords that can be used to refine a trend search.

Let’s now explore these APIs using the Databases as the main topic of investigation and examining some of the different products. I’ve used the db-engines.com website to select the top 5 databases (as per date of this blog post). These were:

  • Oracle
  • MySQL
  • SQL Server
  • PostgreSQL
  • MongoDB

I will use this list to look for number of searches and other related information. First thing is to import the necessary libraries and create the connection to Google Trends.

import pandas as pd 
from pytrends.request import TrendReq

pytrends = TrendReq()

Next setup the payload and keep the timeframe for searches to the past 12 months only.

search_list = ["Oracle", "MySQL", "SQL Server", "PostgreSQL", "MongoDB"] #max of 5 values allowed
pytrends.build_payload(search_list, timeframe='today 12-m')

We can now look at the the interest over time method to see the number of searches, based on a ranking where 100 is the most popular.

df_ot = pd.DataFrame(pytrends.interest_over_time()).drop(columns='isPartial')
df_ot

and to see a breakdown of these number on an hourly bases you can use the get_historical_interest method.

pytrends.get_historical_interest(search_list)

Let’s move on to exploring the level of interest/searches by country. The following retrieves this information, ordered by Oracle (in decending order) and then select the top 20 countries. Here we can see the relative number of searches per country. Note these doe not necessarily related to the countries with the largest number of searches

df_ibr = pytrends.interest_by_region(resolution='COUNTRY') # CITY, COUNTRY or REGION
df_ibr.sort_values('Oracle', ascending=False).head(20)

Visualizing data is always a good thing to do as we can see a patterns and differences in the data in a clearer way. The following takes the above query and creates a stacked bar chart.

import matplotlib
from matplotlib import pyplot as plt

df2 = df_ibr.sort_values('Oracle', ascending=False).head(20)

df2.reset_index().plot(x='geoName', y=['Oracle', 'MySQL', 'SQL Server', 'PostgreSQL', 'MongoDB'], kind ='bar', stacked=True, title="Searches by Country")

plt.rcParams["figure.figsize"] = [20, 8]
plt.xlabel("Country")
plt.ylabel("Ranking")

We can delve into the data more, by focusing on one particular country and examine the google searches by city or region. The following looks at the data from USA and gives the rankings for the various states.

pytrends.build_payload(search_list, geo='US')
df_ibr = pytrends.interest_by_region(resolution='COUNTRY', inc_low_vol=True)
df_ibr.sort_values('Oracle', ascending=False).head(20)

df2.reset_index().plot(x='geoName', y=['Oracle', 'MySQL', 'SQL Server', 'PostgreSQL', 'MongoDB'], kind ='bar', stacked=True, title="test")
plt.rcParams["figure.figsize"] = [20, 8]

plt.title("Searches for USA")
plt.xlabel("State")
plt.ylabel("Ranking")

 

We can find the top related queries and and top queries including the names of each database.

search_list = ["Oracle", "MySQL", "SQL Server", "PostgreSQL", "MongoDB"] #max of 5 values allowed
pytrends.build_payload(search_list, timeframe='today 12-m')

rq = pytrends.related_queries()
rq.values()

#display rising terms
rq.get('Oracle').get('rising')

We can see the top related rising queries for Oracle are about tik tok. No real surprise there!

and the top queries for Oracle included:

rq.get('Oracle').get('top')

This was an interesting exercise to do. I didn’t show all the results, but when you explore the other databases in the list and see the results from those, and then compare them across the five databases you get to see some interesting patterns.

 

Pre-build Machine Learning Models

Posted on Updated on

Machine learning has seen widespread adoption over the past few years. In more recent times we have seem examples of how the models, created by the machine learning algorithms, can be shared. There have been various approaches to sharing these models using different model interchange languages. Some of these have become more or less popular over time, for example a few years ago PMML was very popular, and in more recent times ONNX seems to popular. Who knows what it will be next year or in a couple of years time.

With the increased use of machine learning models and the ability to share them, we are now seeing other uses of them. Typically the sharing of models involved a company transferring a model developed by the data scientists in their lab environment, to DevOps teams who then deploy the model into the production environment. This has developed a new are of expertise of MLOps or AIOps.

The languages and tools used by the data scientists in the lab environment are different to the languages used to deploy applications in production. The model interchange languages can be used take the model parameters, algorithm type and data transformations, etc and map these into the interchange language. The production environment would read this interchange object and apply it to the production language. In such situations the models will use the algorithms already coded in the production language. For example, the lab environment could be using Python. But the product environment could be using C, Java, Go, etc.  Python is an interpretative language and in a lot of cases is not suitable for real-time use in a production environment, due to speed and scalability issues. In this case the underlying algorithm of the production language will be used and not algorithm used in the lab. In theory the algorithms should be the same. For example a decision tree algorithm using Gini Index in one language should function in the same way in another language. We all know there can be a small to a very large difference between what happens in theory and how it works in practice. Different language and different developers will do things slightly differently. This means there will be differences between the accuracy of the models developed in the lab versus the accuracy of the (same) model used in production. As long as everyone is aware of this, then everything will be ok. But it will be important task, for the data science team, to have some measurements of these differences.

Heres One I Made Earlier: 9780857835130: Amazon.com: Books

Moving on a little this a little, we are now seeing some other developments with the development and sharing of machine learning models, and the use of these open model interchange languages, like ONNX, makes this possible.

We are now seeing people making their machine learning models available to the wider community, instead of keeping them within their own team or organization.

Why would some one do this? why would they share their machine learning model?  It’s a bit like the picture to the left which comes from a very popular kids programme on the BBC called Blue Peter. They would regularly show some craft projects for kids to work on at home. They would never show all the steps needed to finish the project and would end up showing us “one I made earlier”. It always looked perfect and nothing like what they tried to make in the studio and nothing like my attempt.

But having pre-made machine learning models is now a thing. There ware lots of examples of these and for example the ONNX website has several pre-trained models ready for you to use. These cover various examples for image classification, object detection, machine translation and comprehension, language modeling, speech and audio processing, etc.  More are being added over time.

Most of these pre-trained models are based on defined data sets and problems and allows others to see what they have done, and start building upon their work without the need to go through the training and validating phase.

Could we have something like this in the commercial world? Could we have pre-trained machine learning models being standardized and shared across different organizations?  Again the in-theory versus in-practical terms apply. Many organizations within a domain use the same or similar applications for capturing, storing, processing and analyzing their data. In this case could the sharing of machine learning models help everyone be more competitive or have better insights and discoveries from their data? Again the difference between in-theory versus in-practice applies.

Some might remember in the early days of Data Warehousing we used to have some industry (dimensional) models, and vendors and consulting companies would offer their custom developed industry models and how to populate these. In theory these were supposed to help companies to speed up their time to data insights and save money. We have seem similar attempts at doing similar things over the decades. But the reality was most projects ended up being way more expensive and took way too long to deploy due to lots of technical difficulties and lots of differences in the business understand, interpretation and deployment of the underlying applications. The pre-built DW model was generic and didn’t really fit in with the business needs.

Although we are seeing more and more pre-trained machine learning models appearing on the market. Many vendors are offering pre-trained solutions. But can these really work. Some of these pre-trained models are based on certain data preparation, using one particular machine learning model and using only one particular evaluation matric. As with the custom DW models of twenty years ago, pre-trained ML models are of limited use.

Everyone is different, data is different, behavior is different, etc. the list goes on. Using the principle of the “No Free Lunch” theorem, although we might be using the same or similar applications for capturing, storing, processing and analysing their data, the underlying behavior of the data (and the transactions, customers etc that influence that), will be different, the marketing campaigns will be different, business semantics may be different, general operating models will be different, etc.  Based on “No Free Lunch” we need to explore the data using a variety of different algorithms, to determine what works for our data at this point in time. The behavior of the data (and business influences on it) keep on changing and evolving on a daily, weekly, monthly, etc basis.  A great example of this but in a more extreme and rapid rate of change happened during the COVID pandemic. Most of the machine learning models developed over the preceding period no longer worked, the models developed during the pandemic have a very short life span, and it will take some time before “normal” will return and newer models can be built to represent the “new normal”

Principal Component Analysis (PCA) in Oracle

Posted on Updated on

Principal Component Analysis (PCA), is a statistical process used for feature or dimensionality reduction in data science and machine learning projects. It summarizes the features of a large data set into a smaller set of features by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data’s variation as possible. There are lots of resources that goes into the mathematics behind this approach. I’m not going to go into that detail here and a quick internet search will get you what you need.

PCA can be used to discover important features from large data sets (large as in having a large number of features), while preserving as much information as possible.

Statistically, PCA finds lines, planes and hyper-planes in the K-dimensional space that approximate the data as well as possible in the least squares sense. A line or plane that is the least squares approximation of a set of data points makes the variance of the coordinates on the line or plane as large as possible.

Oracle has implemented PCA using Sigular Value Decomposition (SVD) on the covariance and correlations between variables, for feature extraction/reduction. PCA is closely related to SVD. PCA computes a set of orthonormal bases (principal components) that are ranked by their corresponding explained variance. The main difference between SVD and PCA is that the PCA projection is not scaled by the singular values. The extracted features are transformed features consisting of linear combinations of the original features.

When machine learning is performed on this reduced set of transformed features, it can completed with less resources and time, while still maintaining accuracy.

Algorithm Name in Oracle using

Mining Model Function = FEATURE_EXTRACTION

Algorithm = ALGO_SINGULAR_VALUE_DECOMP

(Hyper)-Parameters for algorithms

  • SVDS_U_MATRIX_OUTPUT : SVDS_U_MATRIX_ENABLE or SVDS_U_MATRIX_DISABLE
  • SVDS_SCORING_MODE : SVDS_SCORING_SVD or SVDS_SCORING_PCA
  • SVDS_SOLVER : possible values include SVDS_SOLVER_TSSVD, SVDS_SOLVER_TSEIGEN, SVDS_SOLVER_SSVD, SVDS_SOLVER_STEIGEN
  • SVDS_TOLERANCE : range of 0…1
  • SVDS_RANDOM_SEED : range of 0…4294967296 (!)
  • SVDS_OVER_SAMPLING : range of 1…5000
  • SVDS_POWER_ITERATIONS : Default value 2, with possible range of 0…20

Let’s work through an example using the MINING_DATA_BUILD_V data set that comes with Oracle Data Miner.

First step is to define the parameter settings for the algorithm. No data preparation is needed as the algorithm takes care of this. This means you can disable the Automatic Data Preparation (ADP).

-- create the parameter table
CREATE TABLE svd_settings (
setting_name VARCHAR2(30),
setting_value VARCHAR2(4000));

-- define the settings for SVD algorithm
BEGIN 
   INSERT INTO svd_settings (setting_name, setting_value) 
   VALUES (dbms_data_mining.algo_name, dbms_data_mining.algo_singular_value_decomp);

   -- turn OFF ADP
   INSERT INTO svd_settings (setting_name, setting_value) 
   VALUES (dbms_data_mining.prep_auto, dbms_data_mining.prep_auto_off); 

   -- set PCA scoring mode
   INSERT INTO svd_settings (setting_name, setting_value) 
   VALUES (dbms_data_mining.svds_scoring_mode, dbms_data_mining.svds_scoring_pca);

   INSERT INTO svd_settings (setting_name, setting_value) 
   VALUES (dbms_data_mining.prep_shift_2dnum, dbms_data_mining.prep_shift_mean); 

   INSERT INTO svd_settings (setting_name, setting_value) 
   VALUES (dbms_data_mining.prep_scale_2dnum, dbms_data_mining.prep_scale_stddev); 
END;
/

You are now ready to create the model.

BEGIN
   DBMS_DATA_MINING.CREATE_MODEL(
      model_name          => 'SVD_MODEL',
      mining_function     => dbms_data_mining.feature_extraction,
      data_table_name     => 'mining_data_build_v',
      case_id_column_name => 'CUST_ID',
      settings_table_name => 'svd_settings');
END;

When created you can use the mining model data dictionary views to explore the model and to explore the specifics of the model and the various MxN matrix created using the model specific views. These include:

  • DM$VESVD_Model : Singular Value Decomposition S Matrix
  • DM$VGSVD_Model : Global Name-Value Pairs
  • DM$VNSVD_Model : Normalization and Missing Value Handling
  • DM$VSSVD_Model : Computed Settings
  • DM$VUSVD_Model : Singular Value Decomposition U Matrix
  • DM$VVSVD_Model : Singular Value Decomposition V Matrix
  • DM$VWSVD_Model : Model Build Alerts

Where the S, V and U matrix contain:

  • U matrix : consists of a set of ‘left’ orthonormal bases
  • S matrix : is a diagonal matrix
  • V matrix : consists of set of ‘right’ orthonormal bases

These can be explored using the following

-- S matrix
select feature_id, VALUE, variance, pct_cum_variance 
from DM$VESVD_MODEL;

-- V matrix
select feature_id, attribute_name, value
from DM$VVSVD_MODEL
order by feature_id, attribute_name;

-- U matrix
select feature_id, attribute_name, value
from DM$VVSVD_MODEL
order by feature_id, attribute_name;

To determine the projections to be used for visualizations we can use the FEATURE_VALUES function.

select FEATURE_VALUE(svd_sh_sample, 1 USING *) proj1, 
       FEATURE_VALUE(svd_sh_sample, 2 USING *) proj2
from   mining_data_build_v 
where  cust_id <= 101510
order by 1, 2;

 

Other algorithms available in Oracle for feature extraction and reduction include:

  • Non-Negative Matrix Factorization (NMF)
  • Explicit Semantic Analysis (ESA)
  • Minimum Description Length (MDL) – this is really feature selection rather than feature extraction

ONNX for exchanging Machine Learning Models

Posted on Updated on

When working on build predictive application using machine learning algorithms, you will probably be working with such languages as Python, R, PyTorch, TensorFlow, and lots of other frameworks. One of the challenges we face is taking these machine learning models from our test/lab environment and putting into production them. By this I mean integrating them with our production systems to allow real-time use of these ML models. This is not a topic that is discussed very often. Many of the most common languages and frameworks are very easy to use for machine learning, but running them in production can be slow. This can lead to lots of problems and can regularly label machine learning projects as a failure. None of use want that. Sometime people look are re-coding all the machine learning models in other languages such as C or Java or Julia, as these are noted for the high speed and scalability in production environments. (Remember many of the common ML languages and frameworks are actually developed using C and Java.)

To remove the need to recode your models, many of the languages, frameworks and tools have opened to the ability to allow model interchange. This approach allows you to use the tools that work best for you, in your environment and your company, to develop, test and evaluate machine learning models. These can then be packaged up and shared with other languages, frameworks or tools suitable for production environments, eliminating or significantly reducing the need for large coding projects and allows for quicker time to deployment.

There are many machine learning model interchange frameworks available. Historically PMML was popular but with the rise of other machine learning and deep learning algorithms, it seems to have lost the popularity contest. One of the more popular machine learning interchange frameworks is called ONNX. This has been growing in popularity with a wide body of languages, tools and vendors.

Overview of ONNX, Its Advantages and Capabilities

ONNX stands for the Open Neural Network eXchange and is designed to allow developers to easily move between different machine learning and deep learning frameworks. This allows the easy migration from research and model development environments, to other environments more suited to deployment, allowing for faster scoring of data. ONNX allows for the migration of the model with the minimum of recoding. ONNX generates or provides for an extensible computation dataflow graph model, with built-in operators and data types focused on interencing.

To use ONNX with Python install the library:

pip3 install onnx-mxnet

The following is an extract of sample code generating a model, converting it to ONNX format and saving it to file.

#train a model
#load sklearn
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

#load the IRIS sample data set 
iris = load_iris()
X, y = iris.data, iris.target
#create the train and test data sets
X_train, X_test, y_train, y_test = train_test_split(X, y)
#define and create Random Forest data set
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

#convert into ONNX format & save to file
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
initial_type = [('float_input', FloatTensorType([None, 4]))]
#covert to ONNX
onx = convert_sklearn(rf, initial_types=initial_type)
#save to file
with open("rf_iris.onnx", "wb") as f:
    f.write(onx.SerializeToString())

The above example illustrates converting a sklearn model. For algorithms and models, converters exist and are available in the ONNX Github

k-Fold and Repeated k-Fold Cross Validation in Python

Posted on Updated on

When it comes to evaluation the performance of a machine learning model there are a number of different approaches. Plus there are as many different view points on what is the best or better evaluation metric to use.

One of the common approaches is to use k-Fold cross validation. This divides the data in to ‘k‘ non-overlapping parts (or Folds). One of these part/Folds is used for hold out testing and the remaining part/Folds (k-1) are used to train and create a model. This model is then used to applied or fitted to the hold-out ‘k‘ part/Fold. This process is repeated across all the ‘k‘ parts/Folds until all the data has been used. The results from applying or fitting the model are aggregated and the mean performance is report.

Traditionally, ‘k‘ is set to 10 and will be the default value in most/all languages, libraries, packages and application. This number can be changed to anything you want. Most reports indicated a value of between 5 and 10, as these seem to indicate results that don’t suffer from bias or variance.

Let’s take a look at an example of using k-Fold Cross Validation using Scikit-Learning library. First step is to prepare the data.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

bank_file = "/.../4-Datasets/bank-additional-full.csv"

# import dataset
df = pd.read_csv(bank_file, sep=';',)

# get basic details of df (num records, num features)
df.shape

print('Percentage per target class ')
df['y'].value_counts()/len(df) #calculate percentages

#Data Clean up
df = df.drop('duration', axis=1) #this is highly correlated to target variable
df_new = pd.get_dummies(df) #simple and easy approach for categorical variables
df_new.describe()
df['y'] = df['y'].map({'no':0, 'yes':1}) # binary encoding of class label

#split data set into input variables and target variables
## create separate dataframes for Input features (X) and for Target feature (Y)
X_train = df_new.drop('y', axis=1)
Y_train = df_new['y']

Now we can perform k-fold cross valuation.

#load scikit-learn k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

#setup for k-Fold Cross Validation
cv = KFold(n_splits=10, shuffle=True, random_state=1)
#n_splits = number of k-folds
#shuffle = shuffles data set prior to split
#radnom_state = seed for (pseydo)random number generator
#define model
model = LogisticRegression()
#create model, perform cross validation and evaluate model
scores = cross_val_score(model, X_train, Y_train, scoring='accuracy', cv=cv, n_jobs=-1)
#performance result
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

 

We can see from the above example the model is evaluated across 10 folds, giving the accuracy score for each of these. The mean of these 10 accuracy scores is calculated along with the standard deviation, which in this example is very small. You may have slightly different results and this will vary from data set to data set.

The results from k-fold can be nosy, as in each time the code is run a slightly different result may be achieved. This is due to having differing splits of the data set into the k-folds. The model accuracy can vary between each execution and it can be difficult to determine which iteration of the model should be used.

One way to address this possible noise is to estimate the model accurary/performance based on running k-fold a number of times and calculating the performance across all the repeats. This approach is called Repeated k-Fold Cross-Validation. Yes there is a computation cost for performing this approach, and it therefore suited to datasets of smaller scale. In most scenarios having data sets up to 1M records/cases is possible, and depending on the hardware and memory, it can scale to many times that and still be relatively quick to run.

[a small data set for one person could be another persons Big Data set!]

How many repeats should be performed? It kind of depends on how noisy the data is, but in a similar way of having ten as a default value for k, the number of repeats default is ten. Although the typical default is ten, but can be adjusted to say 5, but some testing/experimentation is needed to determine a suitable value.

Building upon the k-fold example code given previously, the following shows can example of using the Repeated k-Fold Cross Validation.

#Repeated k-Fold Cross Validation
#load the necessary libraries
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

#using the same data set created for k-Fold => X_train, Y_train

#Setup and configure settings for Repeated k-Fold CV (k-folds=10, repeats=10)
rcv = RepeatedKFold(n_splits=10, n_repeats=10, random_state=1)

#define model
model = LogisticRegression()

#create model, perform Repeated CV and evaluate model
scores = cross_val_score(model, X_train, Y_train, scoring='accuracy', cv=rcv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))