SQL

Measuring Kurtosis of Data in Oracle (21c)

Posted on Updated on

Kurtosis is a new analytics function in Oracle 21c (20c) and is one of a set of commonly used statistical functions used to evaluate data to see and understand the behavior of the data.

[See my previous post where I give examples of the new Skewness functions]

Kurtosis is the measurement of the tails of the data distribution and its comparison with that of normal distribution. The Kurtosis of the normal distribution is said to be 3. To make interpenetrating results easier (a Zero) kurtosis measure for gaussian/normal distribution by subtracting 3 from its value, this is called Excess Kurtosis. Kurtosis can be used to describe the height or the breath of the distributions, when compared to a normal distributions, although this is not theoretically correct, it gives a simpler explanation and visualization of it. The following diagram gives an example of a normal distribution, a plot of Positive Kurtosis and Negative Kurtosis.

Prior to the new Kurtosis SQL functions (KURTOSIS_POP and KURTOSIS_SAMP), you had to calculate the Kurtosis value manually using something like the following SQL. These use the same data and attributes set used for the Skewness examples.

select avg(KV) K_value
from (select power((age - avg(age) over ())/stddev(age) over (), 4) KV 
      from cust_data)
union all
select avg(KV) K_value
from (select power((duration - avg(duration) over ())/stddev(duration) over (), 4) KV 
      from cust_data);

K_value
------------------------------------------
3.79088571963003808388287765230733611415
23.24420570926391173498028369605428048285

These don’t include the subtraction of 3 to give a zero kurtosis, and these values can be compared to the data distribution charts shown in the Skewness post.

Now with the new Kurtosis functions it simplifies the tasks of getting these values.

SELECT kurtosis_pop(age), kurtosis_samp(age) 
FROM bank_additional
union all
SELECT kurtosis_pop(duration), kurtosis_samp(duration) 
FROM bank_additional;

KURTOSIS_POP         KURTOSIS_SAMP
------------------   -----------------------------------------
 0.791069803527387    0.79131153115443467194451597661213420763
20.245334438614832   20.24793801497878942299945619307526969226

As you can see the Kurtosis function have the subtraction include.

As with the Skewness functions, the SAMP version works on a sample of the data values and as the number inputs increases, and differences between the POP and SAMP will reduce.

 

 

 

 

 

 

Measuring Skewness of Data in Oracle (21c)

Posted on Updated on

When analyzing data you will look at using a variety of different statistical functions to explore variable data insights.

One of these is the Skewness of the data.

Skewness is a measure of the asymmetry of the probability distribution about its mean. This looks a the tail of the data, with a positive value indicating the tail on the right side of the distribution, and a negative value when the tail is on the left hand side. A zero value indicates the tails on both side balance out, as shown in the following image.

Depiction of positive skewness, skewness, and negative skewness.

Most SQL dialects support Skewness using with an inbuilt function. But if it doesn’t then you would need to write your own version of the calculation, for example using the following.

SELECT avg(SV) S_value
FROM   (SELECT  power((age – avg(age) over ())/stddev(age) over (), 3) SV
        FROM    cust_data)

Here are charts illustrating the data in my table. These include the distributions for the AGE and DURATION attributes.

We can see the data is skewed. When we run the above code we get the following values.

Age = 0.78

Duration = 3.26

We can see the skewness of Duration is significantly longer, giving a positive value as the skewness is to the right.

In Oracle 21s we now have new Skewness functions called SKEWNESS_POP and SKEWNESS_SAMP.  The POP version of the function considers all records, where as the SAMP function considers a sample of the records. When your data set grows into many millions of records the SKEWNESS_SAMP will give a quicker response as it works with a sample of the data set

Both functions will give similar values but at the number of input records the returned values will returned will converge.

SELECT skewness_pop(age), skewness_samp(age) 
FROM cust_data;
SELECT skewness_pop(duration), skewness_samp(duration) 
FROM cust_data;

Adding Text Processing to Classification Machine Learning in Oracle Machine Learning

Posted on Updated on

One of the typical machine learning functions is Classification. This is in widespread use across most domains and geographic regions. I’ve written several blog posts on this topic over many years (and going back many, many year) on how to do this using Oracle Machine Learning (OML) (formally known as Oracle Advanced Analytic and in the Oracle Data Miner tool in SQL Developer). Just do a quick search of my blog to find some of these posts.

When it comes to Classification problems, typically the data set will be contain your typical categorical and numerical variables/features. The Automatic Data Preparation (ADP) feature of OML where it automatically pre-processes and transforms these variable for input to the machine learning algorithm. This greatly reduces the boring work of the data scientist and increases their productivity.

But sometimes data sets come with text descriptions. These will contain production descriptions, free format text, and other descriptive data, for example product reviews. But how can this information be included as part of the input data set to the machine learning algorithms. Oracle allows this kind of input data, and a letting bit of setup is needed to tell Oracle how to process the data set. This uses the in-database feature of Oracle Text.

The following example walks through an example of the steps needed to pre-process and include the text processing as part of the machine learning algorithm.

The data set: The data used to illustrate this and to show the steps needed, is a data set from Kaggle webiste. This data set contains 130K Wine Reviews. This data set contain descriptive information of the wine with attributes about each wine including country, region, number of points, price, etc as well as a text description contain a review of the wine.

The following are 2 files containing the DDL (to create the table) and then Import the data set (using sql script with insert statements). These can be run in your schema (in order listed below).

  1. Create table WINEREVIEWS_130K_IMP
  2. Insert records into WINEREVIEWS_130K_IMP table

I’ll leave the Data Exploration to you to do and to discover some early insights.

The ML Question

I want to be able to predict if a wine is a good quality wine, based on the prices and different characteristics of the wine?

Data Preparation

To be able to answer this question the first thing needed is to define a target variable to identify good and bad wines. To do this create a new attribute/feature called POINTS_BIN and populate it based on the number of points a wine has. If it has >90 points it is a good wine, if <90 points it is a bad wine.

ALTER TABLE WineReviews130K_bin ADD POINTS_BIN VARCHAR2(15);

UPDATE WineReviews130K_bin
SET POINTS_BIN = 'GT_90_Points'
WHERE winereviews130k_bin.POINTS >= 90;

UPDATE WineReviews130K_bin
SET POINTS_BIN = 'LT_90_Points'
WHERE winereviews130k_bin.POINTS < 90;

alter table WineReviews130K_bin DROP COLUMN POINTS;

The DESCRIPTION column data type needs to be changed to CLOB. This is to allow the Text Mining feature to work correctly.

-- add a new column of data type CLOB
ALTER TABLE WineReviews130K_bin ADD (DESCRIPTION_NEW CLOB);

-- update new column with data from the DESCRIPTION attribute
UPDATE WineReviews130K_bin SET DESCRIPTION_NEW = DESCRIPTION;

-- drop the DESCRIPTION attribute from table
ALTER TABLE WineReviews130K_bin DROP COLUMN DESCRIPTION;

-- rename the new attribute to replace DESCRIPTION
ALTER TABLE WineReviews130K_bin RENAME COLUMN DESCRIPTION_NEW TO DESCRIPTION;

 

Text Mining Configuration

There are a number of things we need to define for the Text Mining to work, these include a Lexer, Stop Word list and preferences.

First define the Lexer to use. In this case we will use a basic one and basic settings

BEGIN 
   ctx_ddl.create_preference('mylex', 'BASIC_LEXER'); 
   ctx_ddl.set_attribute('mylex', 'printjoins', '_-'); 
   ctx_ddl.set_attribute ( 'mylex', 'index_themes', 'NO'); 
   ctx_ddl.set_attribute ( 'mylex', 'index_text', 'YES'); 
END;

Next we can define a Stop Word List. Oracle Text comes with a predefined set of Stop Word lists for most of the common languages. You can add to one of those list or create your own. Depending on the domain you are working in it might be easier to create your own and it is very straight forward to do. For example:

DECLARE
   v_stoplist_name varchar2(100);
BEGIN
   v_stoplist_name := 'mystop';
   ctx_ddl.create_stoplist(v_stoplist_name, 'BASIC_STOPLIST'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'nonetheless');
   ctx_ddl.add_stopword(v_stoplist_name, 'Mr'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'Mrs'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'Ms'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'a'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'all'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'almost'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'also'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'although'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'an'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'and'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'any'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'are'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'as'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'at'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'be'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'because'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'been'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'both'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'but'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'by'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'can'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'could'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'd'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'did'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'do'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'does'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'either'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'for'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'from'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'had'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'has'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'have'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'having'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'he'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'her'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'here'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'hers'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'him'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'his'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'how'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'however'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'i'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'if'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'in'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'into'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'is'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'it'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'its'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'just'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'll'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'me'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'might'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'my'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'no'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'non'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'nor'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'not'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'of'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'on'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'one'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'only'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'onto'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'or'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'our'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'ours'); 
   ctx_ddl.add_stopword(v_stoplist_name, 's'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'shall'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'she'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'should'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'since'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'so'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'some'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'still'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'such'); 
   ctx_ddl.add_stopword(v_stoplist_name, 't'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'than'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'that'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'the'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'their'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'them'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'then'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'there'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'therefore'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'these'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'they'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'this'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'those'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'though'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'through'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'thus'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'to'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'too'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'until'); 
   ctx_ddl.add_stopword(v_stoplist_name, 've'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'very'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'was'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'we'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'were'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'what'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'when'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'where'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'whether'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'which'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'while'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'who'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'whose'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'why'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'will'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'with'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'would'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'yet'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'you'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'your'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'yours'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'drink');
   ctx_ddl.add_stopword(v_stoplist_name, 'flavors'); 
   ctx_ddl.add_stopword(v_stoplist_name, '2020');
   ctx_ddl.add_stopword(v_stoplist_name, 'now'); 
END;

Next define the preferences for processing the Text, for example what Stop Word list to use, if Fuzzy match is to be used and what language to use for this, number of tokens/words to process and if stemming is to be used.

BEGIN 
   ctx_ddl.create_preference('mywordlist', 'BASIC_WORDLIST');
   ctx_ddl.set_attribute('mywordlist','FUZZY_MATCH','ENGLISH'); 
   ctx_ddl.set_attribute('mywordlist','FUZZY_SCORE','1'); 
   ctx_ddl.set_attribute('mywordlist','FUZZY_NUMRESULTS','5000'); 
   ctx_ddl.set_attribute('mywordlist','SUBSTRING_INDEX','TRUE'); 
   ctx_ddl.set_attribute('mywordlist','STEMMER','ENGLISH'); 
END;

And the final step is to piece it all together by defining a new Text policy

BEGIN
   ctx_ddl.create_policy('my_policy', NULL, NULL, 'mylex', 'mystop', 'mywordlist');
END;

Define Settings for OML Model

We will create two models. An Attribute Importance model and a Classification model. The following defines the model parameters for each of these.

CREATE TABLE att_import_model_settings (setting_name varchar2(30), setting_value varchar2(30)); 
INSERT INTO att_import_model_settings (setting_name, setting_value)  
VALUES (''ALGO_NAME'', ''ALGO_AI_MDL'');
INSERT INTO att_import_model_settings (setting_name, setting_value) 
VALUES (''PREP_AUTO'', ''ON'');
INSERT INTO att_import_model_settings (setting_name, setting_value) 
VALUES (''ODMS_TEXT_POLICY_NAME'', ''my_policy'');
INSERT INTO att_import_model_settings (setting_name, setting_value) 
VALUES (''ODMS_TEXT_MAX_FEATURES'', ''3000'')';
CREATE TABLE wine_model_settings (setting_name varchar2(30), setting_value varchar2(30)); 
INSERT INTO wine_model_settings (setting_name, setting_value)  
VALUES (''ALGO_NAME'', ''ALGO_RANDOM_FOREST'');
INSERT INTO wine_model_settings (setting_name, setting_value) 
VALUES (''PREP_AUTO'', ''ON'');
INSERT INTO wine_model_settings (setting_name, setting_value) 
VALUES (''ODMS_TEXT_POLICY_NAME'', ''my_policy'');
INSERT INTO wine_model_settings (setting_name, setting_value) 
VALUES (''ODMS_TEXT_MAX_FEATURES'', ''3000'')';

Create the Training and Test data sets.

CREATE TABLE wine_train_data
AS SELECT id, country, description, designation, points_bin, price, province, region_1, region_2, taster_name, variety, title
FROM winereviews130k_bin 
SAMPLE (60) SEED (1);
CREATE TABLE wine_test_data
AS SELECT id, country, description, designation, points_bin, price, province, region_1, region_2, taster_name, variety, title
FROM winereviews130k_bin 
WHERE id NOT IN (SELECT id FROM wine_train_data);

All the set up is done, we can move onto the creating the machine learning models.

Create the OML Model (Attribute Importance & Classification)

We are going to create two models. The first is an Attribute Important model. This will look at the data set and will determine what attributes contribute most towards determining the target variable. As we are incorporting Texting Mining we will see what words/tokens from the DESCRIPTION attribute also contribute towards the target variable.

BEGIN
   DBMS_DATA_MINING.CREATE_MODEL(
      model_name          => 'GOOD_WINE_AI',
      mining_function     => DBMS_DATA_MINING.ATTRIBUTE_IMPORTANCE,
      data_table_name     => 'winereviews130k_bin',
      case_id_column_name => 'ID',
      target_column_name  => 'POINTS_BIN',
      settings_table_name => 'att_import_mode_settings');
END;

We can query the system views for Oracle ML to find out what are the important variables.

SELECT * FROM dm$vagood_wine_ai 
ORDER BY attribute_rank;

Here is the listing of the top 15 most important attributes. We can see from the first 15 rows and looking under column ATTRIBUTE_SUBNAME, the words from the DESCRIPTION attribute that seem to be important and contribute towards determining the value in the target attribute.

At this point you might determine, based on domain knowledge, some of these words should be excluded as they are generic for the domain. In this case, go back to the Stop Word List and recreate it with any additional words. This can be repeated until you are happy with the list. In this example, WINE could be excluded by including it in the Stop Word List.

Run the following to create the Classification model. It is very similar to what we ran above with minor changes to the name of the model, the data mining function and the name of the settings table.

BEGIN
   DBMS_DATA_MINING.CREATE_MODEL(
      model_name          => 'GOOD_WINE_MODEL',
      mining_function     => DBMS_DATA_MINING.CLASSIFICATION,
      data_table_name     => 'winereviews130k_bin',
      case_id_column_name => 'ID',
      target_column_name  => 'POINTS_BIN',
      settings_table_name => 'wine_model_settings');
END;

Apply OML Model

The model can be applied in similar ways to any other ML model created using OML. For example the following displays the wine details along with the predicted points bin values (good or bad) and the probability score (<=1) of the prediction.

SELECT id, price, country, designation, province, variety, points_bin, 
       PREDICTION(good_wine_mode USING *) pred_points_bin,
       PREDICTION_PROBABILITY(good_wine_mode USING *) prob_points_bin
FROM wine_test_data;

 

 

Enhanced Window Clause functionality in Oracle 21c (20c)

Posted on Updated on

Updated: Changed 20c to Oracle 21c, as Oracle 20c Database never really existed 🙂

The Oracle Database has had advanced analytical functions for some time now and with each release we get to have some new additions or some enhancements to existing functionality.

One new enhancement, available and documented in 21c (not yet released at time of writing this), is changing in the way the Window Clause can be defined for analytic functions. Oracle 21c is available on Oracle Cloud as a pre-release for evaluation purposes (but it won’t be available for much longer!). The examples shown below are based on using this 21c pre-release of the database.

NOTE: At this point, no one really knows when or if 20c will be released. I’m sure all the documented 20c new features will be rolled into 21c, whenever that will be released.

Before giving some examples of the new Window Clause functionality, lets have a quick recap on how we could use it up to now (up to 19c database). Here is a simple example of windowing the data by creating partitions based on the distinct values in DEPTNO column

select deptno,
      ename,
       job,
       salary,
       avg (salary) over (partition by DEPTNO) avg_sal
from employee
order by deptno;

 

 

 

 

 

 

 

 

 

 

 

 

Here we get to see the average salary being calculated for each window partition and being reset for the next windwo partition.

The SQL:2011 standard support the defining of the Window clause in the query block, after defining the list tables for the query. This allows us to define the window clause one and then reference this for analytic function that need it. The following example illustrate this. I’ve take the able query and altered it to have the newer syntax. I’ve highlighted the new or changed code in blue. In the analytic function, the w1 refers to the Window clause defined later, and is more in keeping with how a query is logically processed.

select deptno, 
ename,
sal,
sum(sal) over (w1) sum_sal
from emp
window w1 as (partition by deptno);


As you would expect we get the same results returned.

This newer syntax is particularly useful when we have many more analytic function in our queries, and some of these are using slightly different windowing. To me it makes it easier to read and to make edits, allowing an edit to be preformed once instead of for each analytic function, and avoids any errors. But making it easier to read and understand is by far the greatest benefit. Here is another example which uses different window clauses using the previous syntax.

SELECT deptno, 
ename,
sal,
AVG(sal) OVER (PARTITION BY deptno ORDER BY sal) AS avg_dept_sal,
AVG(sal) OVER (PARTITION BY deptno ) AS avg_dept_sal2,
SUM(sal) OVER (PARTITION BY deptno ORDER BY sal desc) AS sum_dept_sal
FROM emp;

Using the newer syntax this gets transformed into the following.

SELECT deptno, 
      ename,
      sal,
      AVG(sal) OVER (w1) AS avg_dept_sal,
AVG(sal) OVER (w2) AS avg_dept_sal2,
SUM(sal) OVER (w2) AS avg_dept_sal
FROM emp
window w1 as (PARTITION BY deptno ORDER BY sal),
w2 as (PARTITION BY deptno),
w3 as (PARTITION BY deptno ORDER BY sal desc);

Adam Solver for Neural Networks (OML) in Oracle 21c

Posted on Updated on

Updated: Changed 20c to Oracle 21c, as Oracle 20c Database never really existed 🙂

The ability to create and use Neural Networks on business data has been available in Oracle Database since Oracle 18c (18c and 19c are just slightly extended versions of Oracle 12c). With each minor database release we get some small improvements and minor features added. I’ve written other blog posts about other 21c new machine learning features (see here, here and here).

With Oracle 21c they have added a new neural network solver. This is called Adam Solver and the original research was conducted by Diederik Kingma from OpenAI and Jimmy Ba from the University of Toronto and they presented they work at ICLR 2015. The name Adam is derived from ‘adaptive moment estimation‘. This algorithm, research and paper has gathered some attention in the research community over the past few years. Most of this has been focused on the benefits of using it.

Gentle Introduction to the Adam Optimization Algorithm for Deep ...

But care is needed. As with most machine learning (and deep learning) algorithms, they work up to a point. They may be good on certain problems and input data sets, and then for others they may not be as good or as efficient at producing an optimal outcome. Although using this solver may be beneficial to your problem, using the concept of ‘No Free Lunch’, you will need to prove the solver is beneficial for your problem.

With Oracle Machine Learning there are two Optimization Solver available for the Neural Network algorithm. The default solver is call L-BFGS (Limited memory Broyden-Fletch-Goldfarb-Shanno). This is one of the most popular solvers in use in most algorithms. The is a limited version of BFGS, using less memory (hence the L in the name) This solver finds the descent direction and line search is used to find the appropriate step size. The solver searches for the optimal solution of the loss function to find the extreme value (maximum or minimum) of the loss (cost) function

The Adam Solver uses an extension to stochastic gradient descent. It uses the squared gradients to scale the learning rate and it takes advantage of momentum by using moving average of the gradient instead of gradient. This allows the solver to work quickly by seeing less data and can work well with larger data sets.

With Oracle Data Mining the Adam Solver has the following parameters.

  • ADAM_ALPHA : Learning rate for solver. Default value is 0.001.
  • ADAM_BATCH_ROWS : Number of rows per batch. Default value is 10,000
  • ADAM_BETA1 : Exponential decay rate for 1st moment estimates. Default value is 0.9.
  • ADAM_BETA2 : Exponential decay rate for the 2nd moment estimates. Default value is 0.99.
  • ADAM_GRADIENT_TOLERANCE : Gradient infinity norm tolerance. Default value is 1E-9.

The parameters ADAM_ALPHA and ADAM_BATCH_ROWS can have an effect on the timing for the neural network algorithm to produce the model. Some exploration is needed to determine the optimal values for this parameters based on the size of the data set. For example having a larger value for ADAM_ALPHA results in a faster initial learning before the rates is updated. Small values than the default slows learning down during training.

To tell Oracle Machine Learning to use the Adam Solver the DMSSET_NN_SOLVER parameter needs to be set. The default setting for a neural network is DMSSET_NN_SOLVER_LGFGS.  But to use the Adam solver set it to DMSSET_NN_SOLVER_ADAM.

The following is an example of setting the parameters for the Adam solver and creating a neural network.

BEGIN
   DELETE FROM BANKING_NNET_SETTINGS;

   INSERT INTO BANKING_NNET_SETTINGS (setting_name, setting_value)
   VALUES (dbms_data_mining.algo_name, dbms_data_mining.algo_neural_network);

   INSERT INTO BANKING_NNET_SETTINGS (setting_name, setting_value)
   VALUES (dbms_data_mining.prep_auto, dbms_data_mining.prep_auto_on);

   INSERT INTO BANKING_NNET_SETTINGS (setting_name, setting_value)
   VALUES (dbms_data_mining.nnet_nodes_per_layer, '20,10,6');

   INSERT INTO BANKING_NNET_SETTINGS (setting_name, setting_value)
   VALUES (dbms_data_mining.nnet_iterations, 10);

   INSERT INTO BANKING_NNET_SETTINGS (setting_name, setting_value)
   VALUES (dbms_data_mining.NNET_SOLVER, 'NNET_SOLVER_ADAM');
END;

The addition of the last parameter overrides the default solver for building a neural network model.

To build the model we can use the following.

DECLARE
   v_start_time TIMESTAMP;
BEGIN
   begin DBMS_DATA_MINING.DROP_MODEL('BANKING_NNET_72K_1'); exception when others then null; end;

   v_start_time := current_timestamp;
   DBMS_DATA_MINING.CREATE_MODEL(
      model_name.         => 'BANKING_NNET_72K_1',
      mining_function     => dbms_data_mining.classification,
      data_table_name     => 'BANKING_72K',
      case_id_column_name => 'ID',
      target_column_name  => 'TARGET',
      settings_table_name => 'BANKING_NNET_SETTINGS');

   dbms_output.put_line('Time take to create model = ' || to_char(extract(second from (current_timestamp-v_start_time))) || ' seconds.');
END;

For me on my Oracle 20c Preview Database, this takes 1.8 seconds to run and create the neural network model ob a data set of 72,000 records.

Using the default solver, the model is created in 5.2 seconds. With using a small data set of 72,000 records, we can see the impact of using an Adam Solver  for creating a neural network model.

These timings and the timings shown below (in seconds) are based on the Oracle 20c Preview Database, using a minimum VM sizing and specification available.

RandomForest Machine Learning – Oracle Machine Learning (OML)

Posted on

Oracle Machine Learning has 30+ different machine learning algorithms built into the database. This means you can use SQL to create machine learning models and then use these models to score or label new data stored in the database or as the data is being created dynamically in the applications.

One of the most commonly used machine learning algorithms, over the past few years, is can RandomForest. This post will take a closer look at this algorithm and how you can build & use a RandomForest model.

Random Forest is known as an ensemble machine learning technique that involves the creation of hundreds of decision tree models. These hundreds of models are used to label or score new data by evaluating each of the decision trees and then determining the outcome based on the majority result from all the decision trees. Just like in the game show. The combining of a number of different ways of making a decision can result in a more accurate result or prediction.

Random Forest models can be used for classification and regression types of problems, which form the majority of machine learning systems and solutions. For classification problems, this is where the target variable has either a binary value or a small number of defined values. For classification problems the Random Forest model will evaluate the predicted value for each of the decision trees in the model. The final predicted outcome will be the majority vote for all the decision trees. For regression problems the predicted value is numeric and on some range or scale. For example, we might want to predict a customer’s lifetime value (LTV), or the potential value of an insurance claim, etc. With Random Forest, each decision tree will make a prediction of this numeric value. The algorithm will then average these values for the final predicted outcome.

 

Under the hood, Random Forest is a collection of decision trees. Although decision trees are a popular algorithm for machine learning, they can have a tendency to over fit the model. This can lead higher than expected errors when predicting unseen data. It also gives just one possible way of representing the data and being able to derive a possible predicted outcome.

Random Forest on the other hand relies of the predicted outcomes from many different decision trees, each of which is built in a slightly different way. It is an ensemble technique that combines the predicted outcomes from each decision tree to give one answer. Typically, the number of trees created by the Random Forest algorithm is defined by a parameter setting, and in most languages this can default to 100+ or 200+ trees.

The Random Forest algorithm has three main features:

  • It uses a method called bagging to create different subsets of the original training data
  • It will randomly section different subsets of the features/attributes and build the decision tree based on this subset
  • By creating many different decision trees, based on different subsets of the training data and different subsets of the features, it will increase the probability of capturing all possible ways of modeling the data

For each decision tree produced, the algorithm will use a measure, such as the Gini Index, to select the attributes to split on at each node of the decision tree.

To create a RandomForest model using Oracle Data Mining, you will follow the same process as with any of the other algorithms, the core of these are:

  1. define the parameter settings
  2. create the model
  3. score/label new data

Let’s start with the first step, defining the parameters. As with all the classification algorithms the same or similar parameters are set. With RandomForest we can set an additional parameter which tells the algorithm how many decision trees to create as part of the model. By default, 20 decision trees will be created. But if you want to change this number you can use the RFOR_NUM_TREES parameter. Remember the larger the value the longer it will take to create the model. But will have better accuracy. On the other hand with a small number of trees the quicker the model build will be, but might night be as accurate. This is something you will need to explore and determine. In the following example I change the number of trees to created to ten.

CREATE TABLE BANKING_RF_SETTINGS (
    SETTING_NAME VARCHAR2(50),
    SETTING_VALUE VARCHAR2(50)
);

BEGIN
  DELETE FROM BANKING_RF_SETTINGS;

  INSERT INTO banking_RF_settings (setting_name, setting_value)
  VALUES (dbms_data_mining.algo_name, dbms_data_mining.algo_random_forest);

  INSERT INTO banking_RF_settings (setting_name, setting_value)
  VALUES (dbms_data_mining.prep_auto, dbms_data_mining.prep_auto_on);

  INSERT INTO banking_RF_settings (setting_name, setting_value)
  VALUES (dbms_data_mining.RFOR_NUM_TREES, 10);

  COMMIT;
END;

Other default parameters used include, for creating each decision tree, use random 50% selection of columns and 50% sample of training data.

Now for step 2, create the model.

DECLARE
   v_start_time  TIMESTAMP;
BEGIN
   DBMS_DATA_MINING.DROP_MODEL('BANKING_RF_72K_1');

   v_start_time := current_timestamp;

   DBMS_DATA_MINING.CREATE_MODEL(
      model_name          => 'BANKING_RF_72K_1',
      mining_function     => dbms_data_mining.classification,
      data_table_name     => 'BANKING_72K',
      case_id_column_name => 'ID',
      target_column_name  => 'TARGET',
      settings_table_name => 'BANKING_RF_SETTINGS');

   dbms_output.put_line('Time take to create model = ' || to_char(extract(second from (current_timestamp-v_start_time))) || ' seconds.');
END;

The above code measures how long it takes to create the model.

I’ve run this same parameters and create models for different training data set sizes. I’ve also changed the number of decision trees to create. The following table shows the timings.

You can see it took 5.23 seconds to create a RandomForest model using the default settings for a data set of 72K records. This increase to just over one minute for a data set of 2 million records. Yo can also see the effect of reducing the number of decision trees on how long it takes the create model to run.

For step 3, on using the model on new data, this is just the same as with any of the classification models. Here is an example:

SELECT cust_id, target,
       prediction(BANKING_RF_72K_1 USING *)  predicted_value,
       prediction_probability(BANKING_RF_72K_1 USING *) probability
FROM   bank_test_v;

 

That’s it. That’s all there is to creating a RandomForest machine learning model using Oracle Machine Learning.

It’s quick and easy 🙂

 

MSET (Multivariate State Estimation Technique) in Oracle 20c

Posted on Updated on

Updated: Changed 20c to Oracle 21c, as Oracle 20c Database never really existed 🙂

Oracle 21c Database comes with some new in-database Machine Learning algorithms.

The short name for one of these is called MSET or Multivariate State Estimation Technique. That’s the simple short name. The more complete name is Multivariate State Estimation Technique – Sequential Probability Ratio Test.  That is a long name, and the reason is it consists of two algorithms. The first part looks at creating a model of the training data, and the second part looks at how new data is statistical different to the training data.

 

What are the use cases for this algorithm?  This algorithm can be used for anomaly detection.

Anomaly Detection, using algorithms, is able identifying unexpected items or events in data that differ to the norm. It can be easy to perform some simple calculations and graphics to examine and present data to see if there are any patterns in the data set. When the data sets grow it is difficult for humans to identify anomalies and we need the help of algorithms.

The images shown here are easy to analyze to spot the anomalies and it can be relatively easy to build some automated processing to identify these. Most of these solutions can be considered AI (Artificial Intelligence) solutions as they mimic human behaviors to identify the anomalies, and these example don’t need deep learning, neural networks or anything like that.

Other types of anomalies can be easily spotted in charts or graphics, such as the chart below.

There are many different algorithms available for anomaly detection, and the Oracle Database already has an algorithm called the One-Class Support Vector Machine. This is a variant of the main Support Vector Machine (SVD) algorithm, which maps or transforms the data, using a Kernel function, into space such that the data belonging to the class values are transformed by different amounts. This creates a Hyperplane between the mapped/transformed values and hopefully gives a large margin between the mapped/transformed points. This is what makes SVD very accurate, although it does have some scaling limitations. For a One-Class SVD, a similar process is followed. The aim is for anomalous data to be mapped differently to common or non-anomalous data, as shown in the following diagram.

 

Getting back to the MSET algorithm. Remember it is a 2-part algorithm abbreviated to MSET. The first part is a non-linear, nonparametric anomaly detection algorithm that calibrates the expected behavior of a system based on historical data from the normal sequence of monitored signals. Using data in time series format (DATE, Value) the training data set contains data consisting of “normal” behavior of the data. The algorithm creates a model to represent this “normal”/stationary data/behavior. The second part of the algorithm compares new or live data and calculates the differences between the estimated and actual signal values (residuals). It uses Sequential Probability Ratio Test (SPRT) calculations to determine whether any of the signals have become degraded. As you can imagine the creation of the training data set is vital and may consist of many iterations before determining the optimal training data set to use.

MSET has its origins in computer hardware failures monitoring. Sun Microsystems have been were using it back in the late 1990’s-early 2000’s to monitor and detect for component failures in their servers. Since then MSET has been widely used in power generation plants, airplanes, space travel, Disney uses it for equipment failures, and in more recent times has been extensively used in IOT environments with the anomaly detection focused on signal anomalies.

How does MSET work in Oracle 21c?

An important point to note before we start is, you can use MSET on your typical business data and other data stored in the database. It isn’t just for sensor, IOT, etc data mentioned above and can be used in many different business scenarios.

The first step you need to do is to create the time series data. This can be easily done using a view, but a Very important component is the Time attribute needs to be a DATE format. Additional attributes can be numeric data and these will be used as input to the algorithm for model creation.

-- Create training data set for MSET
CREATE OR REPLACE VIEW mset_train_data
AS SELECT time_id, 
          sum(quantity_sold) quantity,
          sum(amount_sold) amount 
FROM (SELECT * FROM sh.sales WHERE time_id <= '30-DEC-99’)
GROUP BY time_id 
ORDER BY time_id;

The example code above uses the SH schema data, and aggregates the data based on the TIME_ID attribute. This attribute is a DATE data type. The second import part of preparing and formatting the data is Ordering of the data. The ORDER BY is necessary to ensure the data is fed into or processed by the algorithm in the correct time series order.

The next step involves defining the parameters/hyper-parameters for the algorithm. All algorithms come with a set of default values, and in most cases these are suffice for your needs. In that case, you only need to define the Algorithm Name and to turn on Automatic Data Preparation. The following example illustrates this and also includes examples of setting some of the typical parameters for the algorithm.

BEGIN
  DELETE FROM mset_settings;

  -- Select MSET-SPRT as the algorithm
  INSERT  INTO mset_sh_settings (setting_name, setting_value)
  VALUES(dbms_data_mining.algo_name, dbms_data_mining.algo_mset_sprt);

  -- Turn on automatic data preparation
  INSERT INTO mset_sh_settings (setting_name, setting_value)
  VALUES(dbms_data_mining.prep_auto, dbms_data_mining.prep_auto_on);

  -- Set alert count
  INSERT INTO mset_sh_settings (setting_name, setting_value)
  VALUES(dbms_data_mining.MSET_ALERT_COUNT, 3);

  -- Set alert window
  INSERT INTO mset_sh_settings (setting_name, setting_value)
  VALUES(dbms_data_mining.MSET_ALERT_WINDOW, 5);

  -- Set alpha
  INSERT INTO mset_sh_settings (setting_name, setting_value)
  VALUES(dbms_data_mining.MSET_ALPHA_PROB, 0.1);

  COMMIT;
END;

To create the MSET model using the MST_TRAIN_DATA view created above, we can run:

BEGIN
--   DBMS_DATA_MINING.DROP_MODEL(MSET_MODEL');

   DBMS_DATA_MINING.CREATE_MODEL (
      model_name          => 'MSET_MODEL',
      mining_function     => dbms_data_mining.classification,
      data_table_name     => 'MSET_TRAIN_DATA',
      case_id_column_name => 'TIME_ID',
      target_column_name  => '',
      settings_table_name => 'MSET_SETTINGS');
END;

The SELECT statement below is an example of how to call and run the MSET model to label the data to find anomalies. The PREDICTION function will return a values of 0 (zero) or 1 (one) to indicate the predicted values. If the predicted values is 0 (zero) the MSET model has predicted the input record to be anomalous, where as a predicted values of 1 (one) indicates the value is typical. This can be used to filter out the records/data you will want to investigate in more detail.

-- display all dates with Anomalies
SELECT time_id, pred
FROM (SELECT time_id, prediction(mset_sh_model using *) over (ORDER BY time_id) pred 
      FROM mset_test_data)
WHERE pred = 0;

Storing and processing Unicode characters in Oracle

Posted on Updated on

Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems (Wikipedia). The standard is maintained by the Unicode Consortium, and contains over 137,994 characters (137,766 graphic characters, 163 format characters and 65 control characters).

The NVARCHAR2 is Unicode data type that can store Unicode characters in an Oracle Database. The character set of the NVARCHAR2 is national character set specified at the database creation time. Use the following to determine the national character set for your database.

SELECT *
FROM nls_database_parameters
WHERE PARAMETER = 'NLS_NCHAR_CHARACTERSET';

For my database I’m using an Oracle Autonomous Database. This query returns the character set AL16UTF16. This character set encodes Unicode data in the UTF-16 encoding and uses 2 bytes to store a character.

When creating an attribute with this data type, the size value (max 4000) determines the number of characters allowed. The actual size of the attribute will be double.

Let’s setup some data to test this data type.

CREATE TABLE demo_nvarchar2 (
   attribute_name NVARCHAR2(100));

INSERT INTO demo_nvarchar2 
VALUES ('This is a test for nvarchar2');

The string is 28 characters long. We can use the DUMP function to see the details of what is actually stored.

SELECT attribute_name, DUMP(attribute_name,1016)
FROM demo_nvarchar2;

The DUMP function returns a VARCHAR2 value that contains the datatype code, the length in bytes, and the internal representation of a value.

 

You can see the difference in the storage size of the NVARCHAR2 and the VARCHAR2 attributes.

Valid values for the return_format are 8, 10, 16, 17, 1008, 1010, 1016 and 1017. These values are assigned the following meanings:


8 – octal notation
10 – decimal notation
16 – hexadecimal notation
17 – single characters
1008 – octal notation with the character set name
1010 – decimal notation with the character set name
1016 – hexadecimal notation with the character set name
1017 – single characters with the character set name

The returned value from the DUMP function gives the internal data type representation. The following table lists the various codes and their description.

Code Data Type
1 VARCHAR2(size [BYTE | CHAR])
1 NVARCHAR2(size)
2 NUMBER[(precision [, scale]])
8 LONG
12 DATE
21 BINARY_FLOAT
22 BINARY_DOUBLE
23 RAW(size)
24 LONG RAW
69 ROWID
96 CHAR [(size [BYTE | CHAR])]
96 NCHAR[(size)]
112 CLOB
112 NCLOB
113 BLOB
114 BFILE
180 TIMESTAMP [(fractional_seconds)]
181 TIMESTAMP [(fractional_seconds)] WITH TIME ZONE
182 INTERVAL YEAR [(year_precision)] TO MONTH
183 INTERVAL DAY [(day_precision)] TO SECOND[(fractional_seconds)]
208 UROWID [(size)]
231 TIMESTAMP [(fractional_seconds)] WITH LOCAL TIMEZONE

 

Reading Data from Oracle Table into Python Pandas – How long & Different arraysize

Posted on

Here are some results from a little testing I recent did on extracting data from an Oracle database and what effect the arraysize makes and which method might be the quickest.

The arraysize determines how many records will be retrieved in each each batch. When a query is issued to the database, the results are returned to the calling programme in batches of a certain size. Depending on the nature of the application and the number of records being retrieved, will determine the arraysize value. The value of this can have a dramatic effect on your query and application response times. Sometimes a small value works very well but sometimes you might need a larger value.

My test involved using an Oracle Database Cloud instance, using Python and the following values for the arraysize.

arraysize = (5, 50, 500, 1000, 2000, 3000, 4000, 5000) 

The first test was to see what effect these arraysizes have on retrieving all the data from a table. The in question has 73,668 records. So not a large table. The test loops through this list of values and fetches all the data, using the fetchall function (part of cx_Oracle), and then displays the time taken to retrieve the results.

# import the Oracle Python library
import cx_Oracle
import datetime
import pandas as pd
import numpy as np

# setting display width for outputs in PyCharm
desired_width = 280
pd.set_option('display.width', desired_width)
np.set_printoptions(linewidth=desired_width)
pd.set_option('display.max_columns',30)

# define the login details
p_username = "************"
p_password = "************"
p_host = "************"
p_service = "************"
p_port = "1521"

print('--------------------------------------------------------------------------')
print(' Testing the time to extract data from an Oracle Database.')
print('    using different approaches.')
print('---')
# create the connection
con = cx_Oracle.connect(user=p_username, password=p_password, dsn=p_host+"/"+p_service+":"+p_port)

print('')
print(' Test 1: Extracting data using Cursor for different Array sizes')
print('    Array Size = 5, 50, 500, 1000, 2000, 3000, 4000, 5000')
print('')
print('   Starting test at : ', datetime.datetime.now())

beginTime = datetime.datetime.now()
cur_array_size = (5, 50, 500, 1000, 2000, 3000, 4000, 5000)
sql = 'select * from banking_marketing_data_balance_v'

for size in cur_array_size:
    startTime = datetime.datetime.now()
    cur = con.cursor()
    cur.arraysize = size
    results = cur.execute(sql).fetchall()
    print('      Time taken : array size = ', size, ' = ', datetime.datetime.now()-startTime, ' seconds,  num of records = ', len(results))
    cur.close()

print('')
print('   Test 1: Time take = ', datetime.datetime.now()-beginTime)
print('')

And here are the results from this first test.

Starting test at :  2018-11-14 15:51:15.530002
      Time taken : array size =  5  =  0:36:31.855690  seconds,  num of records =  73668
      Time taken : array size =  50  =  0:05:32.444967  seconds,  num of records =  73668
      Time taken : array size =  500  =  0:00:40.757931  seconds,  num of records =  73668
      Time taken : array size =  1000  =  0:00:14.306910  seconds,  num of records =  73668
      Time taken : array size =  2000  =  0:00:10.182356  seconds,  num of records =  73668
      Time taken : array size =  3000  =  0:00:20.894687  seconds,  num of records =  73668
      Time taken : array size =  4000  =  0:00:07.843796  seconds,  num of records =  73668
      Time taken : array size =  5000  =  0:00:06.242697  seconds,  num of records =  73668

As you can see the variation in the results.

You may get different performance results based on your location, network connectivity and proximity of the database. I was at home (Ireland) using wifi and my database was located somewhere in USA. I ran the rest a number of times and the timings varied by +/- 15%, which is a lot!

When the data is retrieved in this manner you can process the data set in the returned results set. Or what is more traditional you will want to work with the data set as a panda. The next two test look at a couple of methods of querying the data and storing the result sets in a panda.

For these two test, I’ll set the arraysize = 3000. Let’s see what happens.

For the second test I’ll again use the fetchall() function to retrieve the data set. From that I extract the names of the columns and then create a panda combining the results data set and the column names.

startTime = datetime.datetime.now()
print('   Starting test at : ', startTime)
cur = con.cursor()
cur.arraysize = cur_array_size
results = cur.execute(sql).fetchall()
print('   Fetched ', len(results), ' in ', datetime.datetime.now()-startTime, ' seconds at ', datetime.datetime.now())
startTime2 = datetime.datetime.now()
col_names = []
for i in range(0, len(cur.description)):
    col_names.append(cur.description[i][0])
print(' Fetched data & Created the list of Column names in ', datetime.datetime.now()-startTime, ' seconds at ', datetime.datetime.now())

The results from this are.

      Fetched  73668  in  0:00:07.778850  seconds at  2018-11-14 16:35:07.840910
      Fetched data & Created the list of Column names in  0:00:07.779043  seconds at  2018-11-14 16:35:07.841093
      Finished creating Dataframe in  0:00:07.975074  seconds at  2018-11-14 16:35:08.037134

Test 2: Total Time take =  0:00:07.975614

Now that was quick. Fetching the data set in just over 7.7788 seconds. Creating the column names as fractions of a millisecond, and then the final creation of the panda took approx 0.13 seconds.

For the third these I used the pandas library function called read_sql(). This function takes two inputs. The first is the query to be processed and the second the name of the database connection.

print(' Test 3: Test timing for read_sql into a dataframe')
cur_array_size = 3000
print('   will use arraysize = ', cur_array_size)
print('')
startTime = datetime.datetime.now()
print('   Starting test at : ', startTime)

df2 = pd.read_sql(sql, con)

print('      Finished creating Dataframe in ', datetime.datetime.now()-startTime, ' seconds at ', datetime.datetime.now())
# close the connection at end of experiments
con.close()

and the results from this are.

   Test 3: Test timing for read_sql into a dataframe will use arraysize =  3000

   Starting test at :  2018-11-14 16:35:08.095189
      Finished creating Dataframe in  0:02:03.200411  seconds at  2018-11-14 16:37:11.295611

You can see that it took just over 2 minutes to create the panda data frame using the read_sql() function, compared to just under 8 seconds using the previous method.

It is important to test the various options for processing your data and find the one that works best in your environment. As with most languages there can be many ways to do the same thing. The challenge is to work out which one you should use.

OUG Ireland 2017 Presentation

Posted on

Here are the slides from my presentation at OUG Ireland 2017. All about running R using SQL.

Formatting results from ORE script in a SELECT statement

Posted on

This blog post looks at how to format the output or the returned returns from an Oracle R Enterprise (ORE), user defined R function, that is run using a SELECT statement in SQL.

Sometimes this can be a bit of a challenge to work out, but it can be relatively easy once you have figured out how to do it. The following examples works through some scenarios of different results sets from a user defined R function that is stored in the Oracle Database.

To run that user defined R function using a SELECT statement I can use one of the following ORE SQL functions.

  • rqEval
  • rqTableEval
  • rqGroupEval
  • rqRowEval

For simplicity we will just use the first of these ORE SQL functions to illustrate the problem and how to go about solving it. The rqEval ORE SQL function is a generate purpose function to call a user defined R script stored in the database. The function does not require any input data set and but it will return some data. You could use this to generate some dummy/test data or to find some information in the database. Here is noddy example that returns my name.

BEGIN
   --sys.rqScriptDrop('GET_NAME');
   sys.rqScriptCreate('GET_NAME',
      'function() {
         res<-data.frame("Brendan")
         res
         } ');
END;

To call this user defined R function I can use the following SQL.

select *
from table(rqEval(null,
                  'select cast(''a'' as varchar2(50))  from dual',
                  'GET_NAME') );  

For text strings returned you need to cast the returned value giving a size.

If we have a numeric value being returned we can don’t have to use the cast and instead use ‘1’ as shown in the following example. This second example extends our user defined R function to return my name and a number.

BEGIN
   sys.rqScriptDrop('GET_NAME');
   sys.rqScriptCreate('GET_NAME',
      'function() {
         res<-data.frame(NAME="Brendan", YEAR=2017)
         res
         } ');
END;

To call the updated GET_NAME function we now have to process two returned columns. The first is the character string and the second is a numeric.

select *
from table(rqEval(null,
                  'select cast(''a'' as varchar2(50)) as "NAME", 1 AS YEAR  from dual',
                  'GET_NAME') );                  

These example illustrate how you can process character strings and numerics being returned by the user defined R script.

The key to setting up the format of the returned values is knowing the structure of the data frame being returned by the user defined R script. Once you know that the rest is (in theory) easy.

Explicit Semantic Analysis setup using SQL and PL/SQL

Posted on Updated on

In my previous blog post I introduced the new Explicit Semantic Analysis (ESA) algorithm and gave an example of how you can build an ESA model and use it. Check out this link for that blog post.

In this blog post I will show you how you can manually create an ESA model. The reason that I’m showing you this way is that the workflow (in ODMr and it’s scheduler) may not be for everyone. You may want to automate the creation or recreation of the ESA model from time to time based on certain business requirements.

In my previous blog post I showed how you can setup a training data set. This comes with ODMr 4.2 but you may need to expand this data set or to use an alternative data set that is more in keeping with your domain.

Setup the ODM Settings table

As with all ODM algorithms we need to create a settings table. This settings table allows us to store the various parameters and their values, that will be used by the algorithm.

-- Create the settings table
CREATE TABLE ESA_settings (
    setting_name VARCHAR2(30),
    setting_value VARCHAR2(30));

-- Populate the settings table
-- Specify ESA. By default, Naive Bayes is used for classification.
-- Specify ADP. By default, ADP is not used. Need to turn this on.
BEGIN
    INSERT INTO ESA_settings (setting_name, setting_value)
    VALUES (dbms_data_mining.algo_name,       
           dbms_data_mining.algo_explicit_semantic_analys);
   
    INSERT INTO ESA_settings (setting_name, setting_value)
    VALUES (dbms_data_mining.prep_auto,dbms_data_mining.prep_auto_on);
  
    INSERT INTO ESA_settings (setting_name, setting_value)
    VALUES (odms_sampling,odms_sampling_disable);
  
    commit;
END; 

These are the minimum number of parameter setting needed to run the ESA algorithm. The other ESA algorithm setting include:

NewImage

Setup the Oracle Text Policy

You also need to setup an Oracle Text Policy and a lexer for the Stopwords.

DECLARE
   v_policy_name  varchar2(30);
   v_lexer_name   varchar2(3)
BEGIN
    v_policy_name  := 'ESA_TEXT_POLICY';
    v_lexer_name   := 'ESA_LEXER';
    ctx_ddl.create_preference(v_lexer_name, 'BASIC_LEXER');
    v_stoplist_name := 'CTXSYS.DEFAULT_STOPLIST';  -- default stop list
    ctx_ddl.create_policy(policy_name => v_policy_name, lexer => v_lexer_name, stoplist => v_stoplist_name);
END;

Create the ESA model

Once we have the settings table created with the parameter values set for the algorithm and the Oracle Text policy created, we can now create the model.

To ensure that the Oracle Text Policy is applied to the text we want to analyse we need to create a transformation list and add the Text Policy to it.

We can then pass the text transformation list as a parameter to the CREATE_MODEL, procedure.

DECLARE
   v_xlst              dbms_data_mining_transform.TRANSFORM_LIST;
   v_policy_name       VARCHAR2(130) := 'ESA_TEXT_POLICY';
   v_model_name        varchar2(50) := 'ESA_MODEL_DEMO_2';
BEGIN
   v_xlst := dbms_data_mining_transform.TRANSFORM_LIST();
   DBMS_DATA_MINING_TRANSFORM.SET_TRANSFORM(v_xlst, '"TEXT"', NULL, '"TEXT"', '"TEXT"', 'TEXT(POLICY_NAME:'||v_policy_name||')(MAX_FEATURES:3000)(MIN_DOCUMENTS:1)(TOKEN_TYPE:NORMAL)');

    DBMS_DATA_MINING.DROP_MODEL(v_model_name, TRUE);
    DBMS_DATA_MINING.CREATE_MODEL(
        model_name          => v_model_name,
        mining_function     => DBMS_DATA_MINING.FEATURE_EXTRACTION,
        data_table_name     => 'WIKISAMPLE',
        case_id_column_name => 'TITLE',
        target_column_name  => NULL,
        settings_table_name => 'ESA_SETTINGS',
        xform_list          => v_xlst);
END;

NOTE: Yes we could have merged all of the above code into one PL/SQL block.

Use the ESA model

We can now use the FEATURE_COMPARE function to use the model we just created, just like I did in my previous blog post.

SELECT FEATURE_COMPARE(ESA_MODEL_DEMO_2
               USING 'Oracle Database is the best available for managing your data' text 
               AND USING 'The SQL language is the one language that all databases have in common' text) similarity 
FROM DUAL;

Go give the ESA algorithm a go and see where you could apply it within your applications.