SQL

Partitioned Models – Oracle Machine Learning (OML)

Posted on Updated on

Building machine learning models can be a relatively trivial task. But getting to that point and understanding what to do next can be challenging. Yes the task of creating a model is simple and usually takes a few line of code. This is what is shown in most examples. But when you try to apply to real world problems we are faced with other challenges. Some of which include volume of data is larger, building efficient ML pipelines is challenging, time to create models gets longer, applying models to new data in real-time takes longer (not possible in real-time), etc. Yes these are typically challenges and most of these can be easily overcome.

When building ML solutions for real-world problem you will be faced with building (and deploying) many 10s or 100s of ML models. Why are so many models needed? Almost every example we see for ML takes the entire data set and build a model on that data. When you think about it, not everyone in the data set can be considered in the same grouping (similar characteristics). If we were to build a model on the data set and apply it to new data, we will get a generic prediction. A prediction comparing the new data item (new customer, purchase, etc) with everyone else in the data population. Maybe this is why so many ML project fail as they are building generic solution that performs badly when run on new (and evolving) data.

To overcome this we start to look at the different groups of data in the data set. Can the data set be divided into a number of different parts based on some characteristics. If we could do this and build a separate model on each group (or cluster), then we would have ML models that would be more accurate with their predictions. This is where we will end up creating 10s or 100s of models. As you can imagine the work involved in doing this with be LOTs. Then think about all the coding needed to manage all of this. What about the complexity of all the code needed for making the predictions on new data.

Yes all of this gets complex very, very quickly!
Ideally we want a separate model for each group

But how can you do that efficiently? is it possible?

When working with Oracle Machine Learning, you can use a feature called partitioned models. Partitioned Models are designed to handle this type of problem. They are designed to:

  • make the building of models simple
  • scales as the data and number of partitions increase
  • includes all the steps part of the ML pipeline (all the data prep, transformations, etc)
  • make predicting new data using the ML model simple
  • make the deployment of the ML model easy
  • make the MLOps process simple
  • make the use of ML model easy to use by all developers no matter the programming language
  • make the ML model build and ML model scoring quick and with better, more accurate predictions.

Screenshot 2020-06-15 11.11.42

Let us work through an example. In this example lets start by creating a Random Forest ML model using the entire data set. The following code shows setting up the Parameters settings table. The second code segment creates the Random Forest ML model. The training data set being used in this example contains 72,000 records.

BEGIN
  DELETE FROM BANKING_RF_SETTINGS;

  INSERT INTO banking_RF_settings (setting_name, setting_value)
  VALUES (dbms_data_mining.algo_name, dbms_data_mining.algo_random_forest);

  INSERT INTO banking_RF_settings (setting_name, setting_value)
  VALUES (dbms_data_mining.prep_auto, dbms_data_mining.prep_auto_on);

 COMMIT;
END;
/

-- Create the ML model
DECLARE
   v_start_time  TIMESTAMP;
BEGIN
   DBMS_DATA_MINING.DROP_MODEL('BANKING_RF_72K_1');

   v_start_time := current_timestamp;

   DBMS_DATA_MINING.CREATE_MODEL(
      model_name          => 'BANKING_RF_72K_1',
      mining_function     => dbms_data_mining.classification,
      data_table_name     => 'BANKING_72K',
      case_id_column_name => 'ID',
      target_column_name  => 'TARGET',
      settings_table_name => 'BANKING_RF_SETTINGS');

   dbms_output.put_line('Time take to create model = ' || to_char(extract(second from (current_timestamp-v_start_time))) || ' seconds.');
END;
/

This is the basic setup and the following table illustrates how long the CREATE_MODEL function takes to run for different sizes of training datasets and with different number of trees per model. The default number of trees is 20.

Screenshot 2020-06-15 12.19.51

To run this model against new data we could use something like the following SQL query.

SELECT cust_id, target,
       prediction(BANKING_RF_72K_1 USING *)  predicted_value,
       prediction_probability(BANKING_RF_72K_1 USING *) probability
FROM   bank_test_v;

This is simple and straight forward to use.

For the 72,000 records it takes just approx 5.23 seconds to create the model, which includes creating 20 Decision Trees. As mentioned earlier, this will be a generic model covering the entire data set.

To create a partitioned model, we can add new parameter which lists the attributes to use to partition the data set. For example, if the partition attribute is MARITAL, we see it has four different values. This means when this attribute is used as the partition attribute, Oracle Machine Learning will create four separate sub Random Forest models all until the one umbrella model. This means the above SQL query to run the model, does not change and the correct sub model will be selected to run on the data based on the value of MARITAL attribute.

To create this partitioned model you need to add the following to the settings table.

BEGIN
  DELETE FROM BANKING_RF_SETTINGS;

  INSERT INTO banking_RF_settings (setting_name, setting_value)
  VALUES (dbms_data_mining.algo_name, dbms_data_mining.algo_random_forest);

  INSERT INTO banking_RF_settings (setting_name, setting_value)
  VALUES (dbms_data_mining.prep_auto, dbms_data_mining.prep_auto_on);

  INSERT INTO banking_RF_settings (setting_name, setting_value)
  VALUES (dbms_data_mining.odms_partition_columns, 'MARITAL’);

COMMIT;
END;
/

The code to create the model remains the same!

The code to call and use the model remains the same!

This keeps everything very simple and very easy to use.

When I ran the CREATE_MODEL code for the partitioned model, it took approx 8.3 seconds to run. Yes it took slightly longer than the previous example, but this time it is creating four models instead of one. This is still very quick!

What if I wanted to add more attributes to the partition key? Yes you can do that. The more attributes you add, the more sub-models will be be created.

For example, if I was to add JOB attribute to the partition key list. I will now get 48 sub-models (with 20 Decision Trees each) being created. The JOB attribute has 12 distinct values, multiplied by the 4 values for MARITAL, gives us 48 models.

INSERT INTO banking_RF_settings (setting_name, setting_value)
VALUES (dbms_data_mining.odms_partition_columns, 'MARITAL,JOB');

How long does this take the CREATE_MODEL code to run? approx 37 seconds!

Again that is quick!

Again remember the code to create the model and to run the model to predict on new data does not change. This means our applications using this ML model does not change. This shows us we can very easily increase the predictive accuracy of our models with only adding one additional model, and by improving this accuracy by adding more attributes to the partition key.

But you do need to be careful with what attributes to include in the partition key. If the attributes have a very high number of distinct values, will result in 100s, or 1000s of sub models being created.

An important benefit of using partitioned models is when a new distinct value occurs in one of the partition key attributes. You code to create the parameters and models does not change. OML will automatically will pick this up and do all the work under the hood.

 

XGBoost in Oracle 20c

Posted on Updated on

Another of the new machine learning algorithms in Oracle 20c Database is called XGBoost. Most people will have come across this algorithm due to its recent popularity with winners of Kaggle competitions and other similar events.

XGBoost is an open source software library providing a gradient boosting framework in most of the commonly used data science, machine learning and software development languages. It has it’s origins back in 2014, but the first official academic publication on the algorithm was published in 2016 by Tianqi Chen and Carlos Guestrin, from the University of Washington.

The algorithm builds upon the previous work on Decision Trees, Bagging, Random Forest, Boosting and Gradient Boosting. The benefits of using these various approaches are well know, researched, developed and proven over many years. XGBoost can be used for the typical use cases of Classification including classification, regression and ranking problems. Check out the original research paper for more details of the inner workings of the algorithm.

Regular machine learning models, like Decision Trees, simply train a single model using a training data set, and only this model is used for predictions. Although a Decision Tree is very simple to create (and very very quick to do so) its predictive power may not be as good as most other algorithms, despite providing model explainability. To overcome this limitation ensemble approaches can be used to create multiple Decision Trees and combines these for predictive purposes. Bagging is an approach where the predictions from multiple DT models are combined using majority voting. Building upon the bagging approach Random Forest uses different subsets of features and subsets of the training data, combining these in different ways to create a collection of DT models and presented as one model to the user. Boosting takes a more iterative approach to refining the models by building sequential models with each subsequent model is focused on minimizing the errors of the previous model. Gradient Boosting uses gradient descent algorithm to minimize errors in subsequent models. Finally with XGBoost builds upon these previous steps enabling parallel processing, tree pruning, missing data treatment, regularization and better cache, memory and hardware optimization. It’s commonly referred to as gradient boosting on steroids.

The following three images illustrates the differences between Decision Trees, Random Forest and XGBoost.

The XGBoost algorithm in Oracle 20c has over 40 different parameter settings, and with most scenarios the default settings with be fine for most scenarios. Only after creating a baseline model with the details will you look to explore making changes to these. Some of the typical settings include:

  • Booster =  gbtree
  • #rounds for boosting = 10
  • max_depth = 6
  • num_parallel_tree = 1
  • eval_metric = Classification error rate  or  RMSE for regression

 

As with most of the Oracle in-database machine learning algorithms, the setup and defining the parameters is really simple. Here is an example of minimum of parameter settings that needs to be defined.

BEGIN
   -- delete previous setttings
   DELETE FROM banking_xgb_settings;

   INSERT INTO BANKING_XGB_SETTINGS (setting_name, setting_value)
   VALUES (dbms_data_mining.algo_name, dbms_data_mining.algo_xgboost);

   -- For 0/1 target, choose binary:logistic as the objective.
   INSERT INTO BANKING_XGB_SETTINGS (setting_name, setting_value)
   VALUES (dbms_data_mining.xgboost_objective, 'binary:logistic);

   commit;
END;

 

To create an XGBoost model run the following.


BEGIN
   DBMS_DATA_MINING.CREATE_MODEL (
      model_name          => 'BANKING_XGB_MODEL',
      mining_function     => dbms_data_mining.classification,
      data_table_name     => 'BANKING_72K',
      case_id_column_name => 'ID',
      target_column_name  => 'TARGET',
      settings_table_name => 'BANKING_XGB_SETTINGS');
END;

That’s all nice and simple, as it should be, and the new model can be called in the same manner as any of the other in-database machine learning models using functions like PREDICTION, PREDICTION_PROBABILITY, etc.

One of the interesting things I found when experimenting with XGBoost was the time it took to create the completed model. Using the default settings the following table gives the time taken, in seconds to create the model.

As you can see it is VERY quick even for large data sets and gives greater predictive accuracy.

 

MSET (Multivariate State Estimation Technique) in Oracle 20c

Posted on

Oracle 20c Database comes with some new in-database Machine Learning algorithms.

The short name for one of these is called MSET or Multivariate State Estimation Technique. That’s the simple short name. The more complete name is Multivariate State Estimation Technique – Sequential Probability Ratio Test.  That is a long name, and the reason is it consists of two algorithms. The first part looks at creating a model of the training data, and the second part looks at how new data is statistical different to the training data.

 

What are the use cases for this algorithm?  This algorithm can be used for anomaly detection.

Anomaly Detection, using algorithms, is able identifying unexpected items or events in data that differ to the norm. It can be easy to perform some simple calculations and graphics to examine and present data to see if there are any patterns in the data set. When the data sets grow it is difficult for humans to identify anomalies and we need the help of algorithms.

The images shown here are easy to analyze to spot the anomalies and it can be relatively easy to build some automated processing to identify these. Most of these solutions can be considered AI (Artificial Intelligence) solutions as they mimic human behaviors to identify the anomalies, and these example don’t need deep learning, neural networks or anything like that.

Other types of anomalies can be easily spotted in charts or graphics, such as the chart below.

There are many different algorithms available for anomaly detection, and the Oracle Database already has an algorithm called the One-Class Support Vector Machine. This is a variant of the main Support Vector Machine (SVD) algorithm, which maps or transforms the data, using a Kernel function, into space such that the data belonging to the class values are transformed by different amounts. This creates a Hyperplane between the mapped/transformed values and hopefully gives a large margin between the mapped/transformed points. This is what makes SVD very accurate, although it does have some scaling limitations. For a One-Class SVD, a similar process is followed. The aim is for anomalous data to be mapped differently to common or non-anomalous data, as shown in the following diagram.

 

Getting back to the MSET algorithm. Remember it is a 2-part algorithm abbreviated to MSET. The first part is a non-linear, nonparametric anomaly detection algorithm that calibrates the expected behavior of a system based on historical data from the normal sequence of monitored signals. Using data in time series format (DATE, Value) the training data set contains data consisting of “normal” behavior of the data. The algorithm creates a model to represent this “normal”/stationary data/behavior. The second part of the algorithm compares new or live data and calculates the differences between the estimated and actual signal values (residuals). It uses Sequential Probability Ratio Test (SPRT) calculations to determine whether any of the signals have become degraded. As you can imagine the creation of the training data set is vital and may consist of many iterations before determining the optimal training data set to use.

MSET has its origins in computer hardware failures monitoring. Sun Microsystems have been were using it back in the late 1990’s-early 2000’s to monitor and detect for component failures in their servers. Since then MSET has been widely used in power generation plants, airplanes, space travel, Disney uses it for equipment failures, and in more recent times has been extensively used in IOT environments with the anomaly detection focused on signal anomalies.

How does MSET work in Oracle 20c?

An important point to note before we start is, you can use MSET on your typical business data and other data stored in the database. It isn’t just for sensor, IOT, etc data mentioned above and can be used in many different business scenarios.

The first step you need to do is to create the time series data. This can be easily done using a view, but a Very important component is the Time attribute needs to be a DATE format. Additional attributes can be numeric data and these will be used as input to the algorithm for model creation.

-- Create training data set for MSET
CREATE OR REPLACE VIEW mset_train_data
AS SELECT time_id, 
          sum(quantity_sold) quantity,
          sum(amount_sold) amount 
FROM (SELECT * FROM sh.sales WHERE time_id <= '30-DEC-99’)
GROUP BY time_id 
ORDER BY time_id;

The example code above uses the SH schema data, and aggregates the data based on the TIME_ID attribute. This attribute is a DATE data type. The second import part of preparing and formatting the data is Ordering of the data. The ORDER BY is necessary to ensure the data is fed into or processed by the algorithm in the correct time series order.

The next step involves defining the parameters/hyper-parameters for the algorithm. All algorithms come with a set of default values, and in most cases these are suffice for your needs. In that case, you only need to define the Algorithm Name and to turn on Automatic Data Preparation. The following example illustrates this and also includes examples of setting some of the typical parameters for the algorithm.

BEGIN
  DELETE FROM mset_settings;

  -- Select MSET-SPRT as the algorithm
  INSERT  INTO mset_sh_settings (setting_name, setting_value)
  VALUES(dbms_data_mining.algo_name, dbms_data_mining.algo_mset_sprt);

  -- Turn on automatic data preparation
  INSERT INTO mset_sh_settings (setting_name, setting_value)
  VALUES(dbms_data_mining.prep_auto, dbms_data_mining.prep_auto_on);

  -- Set alert count
  INSERT INTO mset_sh_settings (setting_name, setting_value)
  VALUES(dbms_data_mining.MSET_ALERT_COUNT, 3);

  -- Set alert window
  INSERT INTO mset_sh_settings (setting_name, setting_value)
  VALUES(dbms_data_mining.MSET_ALERT_WINDOW, 5);

  -- Set alpha
  INSERT INTO mset_sh_settings (setting_name, setting_value)
  VALUES(dbms_data_mining.MSET_ALPHA_PROB, 0.1);

  COMMIT;
END;

To create the MSET model using the MST_TRAIN_DATA view created above, we can run:

BEGIN
--   DBMS_DATA_MINING.DROP_MODEL(MSET_MODEL');

   DBMS_DATA_MINING.CREATE_MODEL (
      model_name          => 'MSET_MODEL',
      mining_function     => dbms_data_mining.classification,
      data_table_name     => 'MSET_TRAIN_DATA',
      case_id_column_name => 'TIME_ID',
      target_column_name  => '',
      settings_table_name => 'MSET_SETTINGS');
END;

The SELECT statement below is an example of how to call and run the MSET model to label the data to find anomalies. The PREDICTION function will return a values of 0 (zero) or 1 (one) to indicate the predicted values. If the predicted values is 0 (zero) the MSET model has predicted the input record to be anomalous, where as a predicted values of 1 (one) indicates the value is typical. This can be used to filter out the records/data you will want to investigate in more detail.

-- display all dates with Anomalies
SELECT time_id, pred
FROM (SELECT time_id, prediction(mset_sh_model using *) over (ORDER BY time_id) pred 
      FROM mset_test_data)
WHERE pred = 0;

Storing and processing Unicode characters in Oracle

Posted on Updated on

Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems (Wikipedia). The standard is maintained by the Unicode Consortium, and contains over 137,994 characters (137,766 graphic characters, 163 format characters and 65 control characters).

The NVARCHAR2 is Unicode data type that can store Unicode characters in an Oracle Database. The character set of the NVARCHAR2 is national character set specified at the database creation time. Use the following to determine the national character set for your database.

SELECT *
FROM nls_database_parameters
WHERE PARAMETER = 'NLS_NCHAR_CHARACTERSET';

For my database I’m using an Oracle Autonomous Database. This query returns the character set AL16UTF16. This character set encodes Unicode data in the UTF-16 encoding and uses 2 bytes to store a character.

When creating an attribute with this data type, the size value (max 4000) determines the number of characters allowed. The actual size of the attribute will be double.

Let’s setup some data to test this data type.

CREATE TABLE demo_nvarchar2 (
   attribute_name NVARCHAR2(100));

INSERT INTO demo_nvarchar2 
VALUES ('This is a test for nvarchar2');

The string is 28 characters long. We can use the DUMP function to see the details of what is actually stored.

SELECT attribute_name, DUMP(attribute_name,1016)
FROM demo_nvarchar2;

The DUMP function returns a VARCHAR2 value that contains the datatype code, the length in bytes, and the internal representation of a value.

 

You can see the difference in the storage size of the NVARCHAR2 and the VARCHAR2 attributes.

Valid values for the return_format are 8, 10, 16, 17, 1008, 1010, 1016 and 1017. These values are assigned the following meanings:


8 – octal notation
10 – decimal notation
16 – hexadecimal notation
17 – single characters
1008 – octal notation with the character set name
1010 – decimal notation with the character set name
1016 – hexadecimal notation with the character set name
1017 – single characters with the character set name

The returned value from the DUMP function gives the internal data type representation. The following table lists the various codes and their description.

Code Data Type
1 VARCHAR2(size [BYTE | CHAR])
1 NVARCHAR2(size)
2 NUMBER[(precision [, scale]])
8 LONG
12 DATE
21 BINARY_FLOAT
22 BINARY_DOUBLE
23 RAW(size)
24 LONG RAW
69 ROWID
96 CHAR [(size [BYTE | CHAR])]
96 NCHAR[(size)]
112 CLOB
112 NCLOB
113 BLOB
114 BFILE
180 TIMESTAMP [(fractional_seconds)]
181 TIMESTAMP [(fractional_seconds)] WITH TIME ZONE
182 INTERVAL YEAR [(year_precision)] TO MONTH
183 INTERVAL DAY [(day_precision)] TO SECOND[(fractional_seconds)]
208 UROWID [(size)]
231 TIMESTAMP [(fractional_seconds)] WITH LOCAL TIMEZONE

 

HiveMall: Transform Categorical features to Numerical

Posted on Updated on

HiveMall is a machine learning library that sits on top of Hive and provides SQL interface to wide range of data preparation and machine learning algorithms.

A common task faced for many machine learning exercises is to convert the data from the format it is captured in (raw data) into a format that is required by the machine learning algorithms. Most ML tools will either have functionality built into the algorithms to do this automatically or will provide functions to allow you to manage this process yourself.

In HiveMall we have the ‘quantified_features’ function and is used for transforming values of non-number columns to indexed numbers, but it does have some unusual but useful features.

In this example I’ll use the titanic data set to illustrate the usage of this feature.

Screenshot 2019-04-29 15.14.42

Here we have a mixture of features with categorical and numerical.

select 
  quantified_features(
    ${output_row}, PassengerId, Survived, Pclass, Sex, Age, SibSp, Parch, Fare, Cabin, Embarked) as features
from (
  select * from titanic
  order by Passengerid asc
) t
limit 5;

and we get the following output

[1.0,0.0,0.0,3.0,0.0,22.0,1.0,0.0,7.25,0.0,1.0]
[2.0,1.0,1.0,1.0,1.0,38.0,1.0,0.0,71.2833,1.0,2.0]
[3.0,1.0,1.0,3.0,1.0,26.0,0.0,0.0,7.9250,0.0,1.0]
[4.0,1.0,1.0,1.0,1.0,35.0,1.0,0.0,53.1,3.0,1.0]
[5.0,1.0,0.0,3.0,0.0,35.0,0.0,0.0,8.05,0.0,1.0]

The ordering within the attributes is important, and some thinking is needed if there is a defined order and you want this reflected in the outputs of the transformed features

If you are a numeric field that you want treated as a categorical, and transformed, you can cast it into a string

e.g.

cast(SibSp as string)

Time Series Forecasting in Oracle – Part 2

Posted on Updated on

This is the second part about time-series data modeling using Oracle. Check out the first part here.

In this post I will take a time-series data set and using the in-database time-series functions model the data, that in turn can be used for predicting future values and trends.

The data set used in these examples is the Rossmann Store Sales data set. It is available on Kaggle and was used in one of their competitions.

Let’s start by aggregating the data to monthly level. We get.

Screenshot 2019-04-16 12.37.59

Data Set-up

Although not strictly necessary, but it can be useful to create a subset of your time-series data to only contain the time related attribute and the attribute containing the data to model. When working with time-series data, the exponential smoothing function expects the time attribute to be of DATE data type. In most cases it does. When it is a DATE, the function will know how to process this and all you need to do is to tell the function the interval.

A view is created to contain the monthly aggregated data.

-- Create input time series
create or replace view demo_ts_data as 
select to_date(to_char(sales_date, 'MON-RRRR'),'MON-RRRR') sales_date,
sum(sales_amt) sales_amt
from demo_time_series
group by to_char(sales_date, 'MON-RRRR')
order by 1 asc;

Next a table is needed to contain the various settings for the exponential smoothing function.

CREATE TABLE demo_ts_settings(setting_name VARCHAR2(30), 
                              setting_value VARCHAR2(128));

Some care is needed with selecting the parameters and their settings as not all combinations can be used.

Example 1 – Holt-Winters

The first example is to create a Holt-Winters time-series model for hour data set. For this we need to set the parameter to include defining the algorithm name, the specific time-series model to use (exsm_holt), the type/size of interval (monthly) and the number of predictions to make into the future, pass the last data point.

BEGIN
   -- delete previous setttings
   delete from demo_ts_settings;

   -- set ESM as the algorithm
   insert into demo_ts_settings 
      values (dbms_data_mining.algo_name,
              dbms_data_mining.algo_exponential_smoothing);

   -- set ESM model to be Holt-Winters
   insert into demo_ts_settings 
      values (dbms_data_mining.exsm_model,
              dbms_data_mining.exsm_holt);

   -- set interval to be month
   insert into demo_ts_settings 
      values (dbms_data_mining.exsm_interval,
              dbms_data_mining.exsm_interval_month);

   -- set prediction to 4 steps ahead
   insert into demo_ts_settings 
      values (dbms_data_mining.exsm_prediction_step,
              '4');

   commit; 
END;

Now we can call the function, generate the model and produce the predicted values.

BEGIN
   -- delete the previous model with the same name
   BEGIN 
      dbms_data_mining.drop_model('DEMO_TS_MODEL');
   EXCEPTION 
      WHEN others THEN null; 
   END;

   dbms_data_mining.create_model(model_name => 'DEMO_TS_MODEL',
                                 mining_function => 'TIME_SERIES',
                                 data_table_name => 'DEMO_TS_DATA',
                                 case_id_column_name => 'SALES_DATE',
                                 target_column_name => 'SALES_AMT',
                                 settings_table_name => 'DEMO_TS_SETTINGS');
END;

When the model is create a number of data dictionary views are populated with model details and some addition views are created specific to the model. One such view commences with DM$VP. Views commencing with this contain the predicted values for our time-series model. You need to append the name of the model create, in our example DEMO_TS_MODEL.

-- get predictions
select case_id, value, prediction, lower, upper 
from   DM$VPDEMO_TS_MODEL
order by case_id;

Screenshot 2019-04-16 16.01.14

When we plot this data we get.

Screenshot 2019-04-16 16.02.57

The blue line contains the original data values and the red line contains the predicted values. The predictions are very similar to those produced using Holt-Winters in Python.

Screenshot 2019-04-16 16.04.45

Example 2 – Holt-Winters including Seasonality

The previous example didn’t really include seasonality int the model and predictions. In this example we introduce seasonality to allow the model to pick up any trends in the data based on a defined period.

For this example we will change the model name to HW_ADDSEA, and the season size to 5 units. A data set with a longer time period would illustrate the different seasons better but this gives you an idea.

BEGIN
   -- delete previous setttings
   delete from demo_ts_settings;

   -- select ESM as the algorithm
   insert into demo_ts_settings 
   values (dbms_data_mining.algo_name,
           dbms_data_mining.algo_exponential_smoothing);

   -- set ESM model to be Holt-Winters Seasonal Adjusted
   insert into demo_ts_settings 
   values (dbms_data_mining.exsm_model,
           dbms_data_mining.exsm_HW_ADDSEA);

   -- set interval to be month
   insert into demo_ts_settings 
   values (dbms_data_mining.exsm_interval,
   dbms_data_mining.exsm_interval_month);

  -- set prediction to 4 steps ahead
  insert into demo_ts_settings 
  values (dbms_data_mining.exsm_prediction_step,
          '4');

   -- set seasonal cycle to be 5 quarters
   insert into demo_ts_settings 
   values (dbms_data_mining.exsm_seasonality,
           '5');

commit; 
END;

We need to re-run the creation of the model and produce the predicted values. This code is unchanged from the previous example.

BEGIN
   -- delete the previous model with the same name
   BEGIN 
      dbms_data_mining.drop_model('DEMO_TS_MODEL');
   EXCEPTION 
      WHEN others THEN null; 
   END;

   dbms_data_mining.create_model(model_name => 'DEMO_TS_MODEL',
                                 mining_function => 'TIME_SERIES',
                                 data_table_name => 'DEMO_TS_DATA',
                                 case_id_column_name => 'SALES_DATE',
                                 target_column_name => 'SALES_AMT',
                                 settings_table_name => 'DEMO_TS_SETTINGS');
END;

When we re-query the DM$VPDEMO_TS_MODEL we get the new values. When plotted we get.

Screenshot 2019-04-16 16.17.30

The blue line contains the original data values and the red line contains the predicted values.

Comparing this chart to the chart from the first example we can see there are some important differences between them. These differences are particularly evident in the second half of the chart, on the right hand side. We get to see there is a clearer dip in the predicted data. This mirrors the real data values better. We also see better predictions as the time line moves to the end.

When performing time-series analysis you really need to spend some time exploring the data, to understand what is happening, visualizing the data, seeing if you can identifying any patterns, before moving onto using the different models. Similarly you will need to explore the various time-series models available and the parameters, to see what works for your data and follow the patterns in your data. There is not magic solution in this case.

Moving Average in SQL (and beyond)

Posted on Updated on

A very common analytics technique for financial and other data is to calculate the moving average. This can allow you to see a different type of pattern in your data that may not is evident from examining the original data.

But how can we calculate the moving average in SQL?

Well, there isn’t a function to do it, but we can use the windowing feature of analytical SQL to do so. The following example was created in an Oracle Database but the same SQL (more or less) will work with most other SQL databases.

SELECT month, 
       SUM(amount) AS month_amount,
       AVG(SUM(amount)) OVER
          (ORDER BY month ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS moving_average
FROM  sales
GROUP BY month
ORDER BY month;

This gives us the following with the moving average calculated based on the current value and the three preceding values, if they exist.

    MONTH MONTH_AMOUNT MOVING_AVERAGE
---------- ------------ --------------
         1     58704.52       58704.52
         2      28289.3       43496.91
         3     20167.83       35720.55
         4      50082.9     39311.1375
         5     17212.66     28938.1725
         6     31128.92     29648.0775
         7     78299.47     44180.9875
         8     42869.64     42377.6725
         9     35299.22     46899.3125
        10     43028.38     49874.1775
        11     26053.46      36812.675
        12     20067.28      31112.085

In some analytic languages and databases, they have included a moving average function. For example using HiveMall on Hive we have.

SELECT moving_avg(x, 3) FROM (SELECT explode(array(1.0,2.0,3.0,4.0,5.0,6.0,7.0)) as x) series;

If you are using Python, there is an inbuilt function in Pandas.

rolmean4 = timeseries.rolling(window = 4).mean()

How long does it take to build a Machine Learning model using Oracle Cloud

Posted on Updated on

Everyday someone talks about the the processing power needed for Machine Learning, and the vast computing needed for these tasks. It has become evident that most of these people have never created a machine learning model. Never. But like to make up stuff and try to make themselves look like an expert, or as I and others like to call them a “fake expert”.

When you question these “fake experts” about this topic, they huff and puff about lots of things and never answer the question or try to claim it is so difficult, you simply don’t understand.

Having worked in the area of machine learning for a very very long time, I’ve never really had performance issues with creating models. Yes most of the time I’ve been able to use my laptop. Yes my laptop to build models large models. In a couple of these my laptop couldn’t cope and I moved onto a server.

But over the past few years we keep hearing about using cloud services for machine learning. If you are doing machine learning you need to computing capabilities that are available with cloud services.

So, the results below show the results of building machine learning models, using different algorithms, with different sizes of data sets.

For this test, I used a basic cloud service. Well maybe it isn’t basic, but for others they will consider it very basic with very little compute involved.

I used an Oracle Cloud DBaaS for this experiment. I selected an Oracle 18c Extreme edition cloud service. This comes with the in-database machine learning option. This comes with 1 OCPUs, 7.5G Memory and 170GB storage. This is the basic configuration.

Next I created data sets with different sizes. These were based on one particular data set, as this ensures that as the data set size increases, the same kind of data and processing required remained consistent, instead of using completely different data sets.

The data set consisted of the following number of records, 72K, 660K, 210K, 2M, 10M and 50M.

I then created machine learning models using Decisions Tree, Naive Bayes, Support Vector Machine, Generaliszd Linear Models (GLM) and Neural Networks. Yes it was a typical classification problem.

The following table below shows the length of time in seconds to build the models. All data preparations etc was done prior to this.

Note: It should be noted that Automatic Data Preparation was turned on for these algorithms. This performed additional algorithm specific data preparation for each model. That means the times given in the following tables is for some data preparation time and for building the models.

ml_on_dbaas_1

Converting the above table into minutes.

ml_on_dbaas_2

It is clear that the Neural Network model takes a lot longer to build than all the other algorithms. In this test the Neural Network model had only one hidden layer.
When we chart the build timings, leaving out Neural Networks, we get.
ml_on_dbaas_3 
We can see Naive Bayes, Decision Tree, GLM and SVM algorithms have very similar model build timings, but as the data volumes increase the Decision Tree algorithm become less efficient.
Overall it doesn’t take a long time to build models. In a way it is a very trivial task!
I mentioned at the start of this post I had created a data set of 50M records. Unfortunately I wasn’t able to get models build for this data set using this cloud instance. It used used so much TEMP tablespace that the file volumes on my cloud instance ran out of space!
I suppose if I wanted to go bigger with my data, I needed a bigger boat!
I haven’t included any timings for model scoring using these models. Why? the scored data is immediately returned event for large the largest data sets.

 

RandomForests in R, Python and SQL

Posted on Updated on

I recently wrote a two part article explaining how Random Forests work and how to use them in R, Python and SQL.

These were posted on ToadWorld webpages. Check them out.

Part 1 of article

https://blog.toadworld.com/2018/08/31/random-forest-machine-learning-in-r-python-and-sql-part-1

 

Part 2 of article

https://blog.toadworld.com/2018/09/01/random-forest-machine-learning-in-r-python-and-sql-part-2

R vs Python vs SQL for Machine Learning (Infographic)

Posted on Updated on

Next week I’ll be giving several presentation on machine learning at Oracle Open World and Oracle Code One. In one of these presentation an evaluation of using R vs Python vs SQL will be given and discussed.

Check out the infographic containing the comparisons.

Info Graphic

 

OUG Ireland 2017 Presentation

Posted on

Here are the slides from my presentation at OUG Ireland 2017. All about running R using SQL.

Presentations from OUGN17

Posted on

Here are the presentations I gave at OUG Norway last week. These are also available on SlideShare