Oracle

Partitioned Models – Oracle Machine Learning (OML)

Posted on Updated on

Building machine learning models can be a relatively trivial task. But getting to that point and understanding what to do next can be challenging. Yes the task of creating a model is simple and usually takes a few line of code. This is what is shown in most examples. But when you try to apply to real world problems we are faced with other challenges. Some of which include volume of data is larger, building efficient ML pipelines is challenging, time to create models gets longer, applying models to new data in real-time takes longer (not possible in real-time), etc. Yes these are typically challenges and most of these can be easily overcome.

When building ML solutions for real-world problem you will be faced with building (and deploying) many 10s or 100s of ML models. Why are so many models needed? Almost every example we see for ML takes the entire data set and build a model on that data. When you think about it, not everyone in the data set can be considered in the same grouping (similar characteristics). If we were to build a model on the data set and apply it to new data, we will get a generic prediction. A prediction comparing the new data item (new customer, purchase, etc) with everyone else in the data population. Maybe this is why so many ML project fail as they are building generic solution that performs badly when run on new (and evolving) data.

To overcome this we start to look at the different groups of data in the data set. Can the data set be divided into a number of different parts based on some characteristics. If we could do this and build a separate model on each group (or cluster), then we would have ML models that would be more accurate with their predictions. This is where we will end up creating 10s or 100s of models. As you can imagine the work involved in doing this with be LOTs. Then think about all the coding needed to manage all of this. What about the complexity of all the code needed for making the predictions on new data.

Yes all of this gets complex very, very quickly!
Ideally we want a separate model for each group

But how can you do that efficiently? is it possible?

When working with Oracle Machine Learning, you can use a feature called partitioned models. Partitioned Models are designed to handle this type of problem. They are designed to:

  • make the building of models simple
  • scales as the data and number of partitions increase
  • includes all the steps part of the ML pipeline (all the data prep, transformations, etc)
  • make predicting new data using the ML model simple
  • make the deployment of the ML model easy
  • make the MLOps process simple
  • make the use of ML model easy to use by all developers no matter the programming language
  • make the ML model build and ML model scoring quick and with better, more accurate predictions.

Screenshot 2020-06-15 11.11.42

Let us work through an example. In this example lets start by creating a Random Forest ML model using the entire data set. The following code shows setting up the Parameters settings table. The second code segment creates the Random Forest ML model. The training data set being used in this example contains 72,000 records.

BEGIN
  DELETE FROM BANKING_RF_SETTINGS;

  INSERT INTO banking_RF_settings (setting_name, setting_value)
  VALUES (dbms_data_mining.algo_name, dbms_data_mining.algo_random_forest);

  INSERT INTO banking_RF_settings (setting_name, setting_value)
  VALUES (dbms_data_mining.prep_auto, dbms_data_mining.prep_auto_on);

 COMMIT;
END;
/

-- Create the ML model
DECLARE
   v_start_time  TIMESTAMP;
BEGIN
   DBMS_DATA_MINING.DROP_MODEL('BANKING_RF_72K_1');

   v_start_time := current_timestamp;

   DBMS_DATA_MINING.CREATE_MODEL(
      model_name          => 'BANKING_RF_72K_1',
      mining_function     => dbms_data_mining.classification,
      data_table_name     => 'BANKING_72K',
      case_id_column_name => 'ID',
      target_column_name  => 'TARGET',
      settings_table_name => 'BANKING_RF_SETTINGS');

   dbms_output.put_line('Time take to create model = ' || to_char(extract(second from (current_timestamp-v_start_time))) || ' seconds.');
END;
/

This is the basic setup and the following table illustrates how long the CREATE_MODEL function takes to run for different sizes of training datasets and with different number of trees per model. The default number of trees is 20.

Screenshot 2020-06-15 12.19.51

To run this model against new data we could use something like the following SQL query.

SELECT cust_id, target,
       prediction(BANKING_RF_72K_1 USING *)  predicted_value,
       prediction_probability(BANKING_RF_72K_1 USING *) probability
FROM   bank_test_v;

This is simple and straight forward to use.

For the 72,000 records it takes just approx 5.23 seconds to create the model, which includes creating 20 Decision Trees. As mentioned earlier, this will be a generic model covering the entire data set.

To create a partitioned model, we can add new parameter which lists the attributes to use to partition the data set. For example, if the partition attribute is MARITAL, we see it has four different values. This means when this attribute is used as the partition attribute, Oracle Machine Learning will create four separate sub Random Forest models all until the one umbrella model. This means the above SQL query to run the model, does not change and the correct sub model will be selected to run on the data based on the value of MARITAL attribute.

To create this partitioned model you need to add the following to the settings table.

BEGIN
  DELETE FROM BANKING_RF_SETTINGS;

  INSERT INTO banking_RF_settings (setting_name, setting_value)
  VALUES (dbms_data_mining.algo_name, dbms_data_mining.algo_random_forest);

  INSERT INTO banking_RF_settings (setting_name, setting_value)
  VALUES (dbms_data_mining.prep_auto, dbms_data_mining.prep_auto_on);

  INSERT INTO banking_RF_settings (setting_name, setting_value)
  VALUES (dbms_data_mining.odms_partition_columns, 'MARITAL’);

COMMIT;
END;
/

The code to create the model remains the same!

The code to call and use the model remains the same!

This keeps everything very simple and very easy to use.

When I ran the CREATE_MODEL code for the partitioned model, it took approx 8.3 seconds to run. Yes it took slightly longer than the previous example, but this time it is creating four models instead of one. This is still very quick!

What if I wanted to add more attributes to the partition key? Yes you can do that. The more attributes you add, the more sub-models will be be created.

For example, if I was to add JOB attribute to the partition key list. I will now get 48 sub-models (with 20 Decision Trees each) being created. The JOB attribute has 12 distinct values, multiplied by the 4 values for MARITAL, gives us 48 models.

INSERT INTO banking_RF_settings (setting_name, setting_value)
VALUES (dbms_data_mining.odms_partition_columns, 'MARITAL,JOB');

How long does this take the CREATE_MODEL code to run? approx 37 seconds!

Again that is quick!

Again remember the code to create the model and to run the model to predict on new data does not change. This means our applications using this ML model does not change. This shows us we can very easily increase the predictive accuracy of our models with only adding one additional model, and by improving this accuracy by adding more attributes to the partition key.

But you do need to be careful with what attributes to include in the partition key. If the attributes have a very high number of distinct values, will result in 100s, or 1000s of sub models being created.

An important benefit of using partitioned models is when a new distinct value occurs in one of the partition key attributes. You code to create the parameters and models does not change. OML will automatically will pick this up and do all the work under the hood.

 

New Oracle Machine Learning Features in 19c and 20c

Posted on

Here are links to blog posts and articles I’ve written about the new features of Oracle Machine Learning in 19c (and previous) and 20c.

I’ve given a presentation on these topics at ACES@Home and Yatra online conferences.

Each of the following links will explain each of the algorithms, and gives demo code for you to try.

  • RandomForest

https://oralytics.com/2020/06/24/randomforest-machine-learning-oracle-machine-learning-oml/

  • Neural Networks

https://developer.oracle.com/databases/neural-network-machine-learning.html

  • Time Series Forecasting

https://oralytics.com/2019/04/15/time-series-forecasting-in-oracle-part-1/

https://oralytics.com/2019/04/23/time-series-forecasting-in-oracle-part-2/

  • XGBoost

https://oralytics.com/2020/04/27/xgboost-in-oracle-20c/

  • Multivariate State Estimation Technique (MSET)

https://oralytics.com/2020/04/13/mset-multivariate-state-estimation-technique-in-oracle-20c/

  • Partitioned Models

https://oralytics.com/2020/07/13/partitioned-models-oracle-machine-learning-oml/

  • Parallel Model Creation

https://oralytics.com/2020/07/27/creating-oml-models-in-parallel/

 

GoLang: Links to blog posts – working with Oracle Database

Posted on Updated on

This post is to serve as a link to my other blog posts on using GoLang and connecting to/working with data from an Oracle Database.

 

Connecting Go Lang to Oracle Database

The database driver/library got renamed. The following post goes through how to updated to new name.

GoLang: Oracle driver/library renamed to : godror

GoLang: Querying records from Oracle Database using goracle

GoLang: Inserting records into Oracle Database using goracle

Importance of setting Fetched Rows size for Database Query using Golang

GoLang – Consuming Oracle REST API from an Oracle Cloud Database)

Machine Learning with Go Lang

 

This post will be updated with new GoLang posts.

 

 

XGBoost in Oracle 20c

Posted on Updated on

Another of the new machine learning algorithms in Oracle 20c Database is called XGBoost. Most people will have come across this algorithm due to its recent popularity with winners of Kaggle competitions and other similar events.

XGBoost is an open source software library providing a gradient boosting framework in most of the commonly used data science, machine learning and software development languages. It has it’s origins back in 2014, but the first official academic publication on the algorithm was published in 2016 by Tianqi Chen and Carlos Guestrin, from the University of Washington.

The algorithm builds upon the previous work on Decision Trees, Bagging, Random Forest, Boosting and Gradient Boosting. The benefits of using these various approaches are well know, researched, developed and proven over many years. XGBoost can be used for the typical use cases of Classification including classification, regression and ranking problems. Check out the original research paper for more details of the inner workings of the algorithm.

Regular machine learning models, like Decision Trees, simply train a single model using a training data set, and only this model is used for predictions. Although a Decision Tree is very simple to create (and very very quick to do so) its predictive power may not be as good as most other algorithms, despite providing model explainability. To overcome this limitation ensemble approaches can be used to create multiple Decision Trees and combines these for predictive purposes. Bagging is an approach where the predictions from multiple DT models are combined using majority voting. Building upon the bagging approach Random Forest uses different subsets of features and subsets of the training data, combining these in different ways to create a collection of DT models and presented as one model to the user. Boosting takes a more iterative approach to refining the models by building sequential models with each subsequent model is focused on minimizing the errors of the previous model. Gradient Boosting uses gradient descent algorithm to minimize errors in subsequent models. Finally with XGBoost builds upon these previous steps enabling parallel processing, tree pruning, missing data treatment, regularization and better cache, memory and hardware optimization. It’s commonly referred to as gradient boosting on steroids.

The following three images illustrates the differences between Decision Trees, Random Forest and XGBoost.

The XGBoost algorithm in Oracle 20c has over 40 different parameter settings, and with most scenarios the default settings with be fine for most scenarios. Only after creating a baseline model with the details will you look to explore making changes to these. Some of the typical settings include:

  • Booster =  gbtree
  • #rounds for boosting = 10
  • max_depth = 6
  • num_parallel_tree = 1
  • eval_metric = Classification error rate  or  RMSE for regression

 

As with most of the Oracle in-database machine learning algorithms, the setup and defining the parameters is really simple. Here is an example of minimum of parameter settings that needs to be defined.

BEGIN
   -- delete previous setttings
   DELETE FROM banking_xgb_settings;

   INSERT INTO BANKING_XGB_SETTINGS (setting_name, setting_value)
   VALUES (dbms_data_mining.algo_name, dbms_data_mining.algo_xgboost);

   -- For 0/1 target, choose binary:logistic as the objective.
   INSERT INTO BANKING_XGB_SETTINGS (setting_name, setting_value)
   VALUES (dbms_data_mining.xgboost_objective, 'binary:logistic);

   commit;
END;

 

To create an XGBoost model run the following.


BEGIN
   DBMS_DATA_MINING.CREATE_MODEL (
      model_name          => 'BANKING_XGB_MODEL',
      mining_function     => dbms_data_mining.classification,
      data_table_name     => 'BANKING_72K',
      case_id_column_name => 'ID',
      target_column_name  => 'TARGET',
      settings_table_name => 'BANKING_XGB_SETTINGS');
END;

That’s all nice and simple, as it should be, and the new model can be called in the same manner as any of the other in-database machine learning models using functions like PREDICTION, PREDICTION_PROBABILITY, etc.

One of the interesting things I found when experimenting with XGBoost was the time it took to create the completed model. Using the default settings the following table gives the time taken, in seconds to create the model.

As you can see it is VERY quick even for large data sets and gives greater predictive accuracy.

 

MSET (Multivariate State Estimation Technique) in Oracle 20c

Posted on

Oracle 20c Database comes with some new in-database Machine Learning algorithms.

The short name for one of these is called MSET or Multivariate State Estimation Technique. That’s the simple short name. The more complete name is Multivariate State Estimation Technique – Sequential Probability Ratio Test.  That is a long name, and the reason is it consists of two algorithms. The first part looks at creating a model of the training data, and the second part looks at how new data is statistical different to the training data.

 

What are the use cases for this algorithm?  This algorithm can be used for anomaly detection.

Anomaly Detection, using algorithms, is able identifying unexpected items or events in data that differ to the norm. It can be easy to perform some simple calculations and graphics to examine and present data to see if there are any patterns in the data set. When the data sets grow it is difficult for humans to identify anomalies and we need the help of algorithms.

The images shown here are easy to analyze to spot the anomalies and it can be relatively easy to build some automated processing to identify these. Most of these solutions can be considered AI (Artificial Intelligence) solutions as they mimic human behaviors to identify the anomalies, and these example don’t need deep learning, neural networks or anything like that.

Other types of anomalies can be easily spotted in charts or graphics, such as the chart below.

There are many different algorithms available for anomaly detection, and the Oracle Database already has an algorithm called the One-Class Support Vector Machine. This is a variant of the main Support Vector Machine (SVD) algorithm, which maps or transforms the data, using a Kernel function, into space such that the data belonging to the class values are transformed by different amounts. This creates a Hyperplane between the mapped/transformed values and hopefully gives a large margin between the mapped/transformed points. This is what makes SVD very accurate, although it does have some scaling limitations. For a One-Class SVD, a similar process is followed. The aim is for anomalous data to be mapped differently to common or non-anomalous data, as shown in the following diagram.

 

Getting back to the MSET algorithm. Remember it is a 2-part algorithm abbreviated to MSET. The first part is a non-linear, nonparametric anomaly detection algorithm that calibrates the expected behavior of a system based on historical data from the normal sequence of monitored signals. Using data in time series format (DATE, Value) the training data set contains data consisting of “normal” behavior of the data. The algorithm creates a model to represent this “normal”/stationary data/behavior. The second part of the algorithm compares new or live data and calculates the differences between the estimated and actual signal values (residuals). It uses Sequential Probability Ratio Test (SPRT) calculations to determine whether any of the signals have become degraded. As you can imagine the creation of the training data set is vital and may consist of many iterations before determining the optimal training data set to use.

MSET has its origins in computer hardware failures monitoring. Sun Microsystems have been were using it back in the late 1990’s-early 2000’s to monitor and detect for component failures in their servers. Since then MSET has been widely used in power generation plants, airplanes, space travel, Disney uses it for equipment failures, and in more recent times has been extensively used in IOT environments with the anomaly detection focused on signal anomalies.

How does MSET work in Oracle 20c?

An important point to note before we start is, you can use MSET on your typical business data and other data stored in the database. It isn’t just for sensor, IOT, etc data mentioned above and can be used in many different business scenarios.

The first step you need to do is to create the time series data. This can be easily done using a view, but a Very important component is the Time attribute needs to be a DATE format. Additional attributes can be numeric data and these will be used as input to the algorithm for model creation.

-- Create training data set for MSET
CREATE OR REPLACE VIEW mset_train_data
AS SELECT time_id, 
          sum(quantity_sold) quantity,
          sum(amount_sold) amount 
FROM (SELECT * FROM sh.sales WHERE time_id <= '30-DEC-99’)
GROUP BY time_id 
ORDER BY time_id;

The example code above uses the SH schema data, and aggregates the data based on the TIME_ID attribute. This attribute is a DATE data type. The second import part of preparing and formatting the data is Ordering of the data. The ORDER BY is necessary to ensure the data is fed into or processed by the algorithm in the correct time series order.

The next step involves defining the parameters/hyper-parameters for the algorithm. All algorithms come with a set of default values, and in most cases these are suffice for your needs. In that case, you only need to define the Algorithm Name and to turn on Automatic Data Preparation. The following example illustrates this and also includes examples of setting some of the typical parameters for the algorithm.

BEGIN
  DELETE FROM mset_settings;

  -- Select MSET-SPRT as the algorithm
  INSERT  INTO mset_sh_settings (setting_name, setting_value)
  VALUES(dbms_data_mining.algo_name, dbms_data_mining.algo_mset_sprt);

  -- Turn on automatic data preparation
  INSERT INTO mset_sh_settings (setting_name, setting_value)
  VALUES(dbms_data_mining.prep_auto, dbms_data_mining.prep_auto_on);

  -- Set alert count
  INSERT INTO mset_sh_settings (setting_name, setting_value)
  VALUES(dbms_data_mining.MSET_ALERT_COUNT, 3);

  -- Set alert window
  INSERT INTO mset_sh_settings (setting_name, setting_value)
  VALUES(dbms_data_mining.MSET_ALERT_WINDOW, 5);

  -- Set alpha
  INSERT INTO mset_sh_settings (setting_name, setting_value)
  VALUES(dbms_data_mining.MSET_ALPHA_PROB, 0.1);

  COMMIT;
END;

To create the MSET model using the MST_TRAIN_DATA view created above, we can run:

BEGIN
--   DBMS_DATA_MINING.DROP_MODEL(MSET_MODEL');

   DBMS_DATA_MINING.CREATE_MODEL (
      model_name          => 'MSET_MODEL',
      mining_function     => dbms_data_mining.classification,
      data_table_name     => 'MSET_TRAIN_DATA',
      case_id_column_name => 'TIME_ID',
      target_column_name  => '',
      settings_table_name => 'MSET_SETTINGS');
END;

The SELECT statement below is an example of how to call and run the MSET model to label the data to find anomalies. The PREDICTION function will return a values of 0 (zero) or 1 (one) to indicate the predicted values. If the predicted values is 0 (zero) the MSET model has predicted the input record to be anomalous, where as a predicted values of 1 (one) indicates the value is typical. This can be used to filter out the records/data you will want to investigate in more detail.

-- display all dates with Anomalies
SELECT time_id, pred
FROM (SELECT time_id, prediction(mset_sh_model using *) over (ORDER BY time_id) pred 
      FROM mset_test_data)
WHERE pred = 0;

Benchmarking calling Oracle Machine Learning using REST

Posted on Updated on

Over the past year I’ve been presenting, blogging and sharing my experiences of using REST to expose Oracle Machine Learning models to developers in other languages, for example Python.

One of the questions I’ve been asked is, Does it scale?

Although I’ve used it in several projects to great success, there are no figures I can report publicly on how many REST API calls can be serviced 😦

But this can be easily done, and the results below are based on using and Oracle Autonomous Data Warehouse (ADW) on the Oracle Always Free.

The machine learning model is built on a Wine reviews data set, using Oracle Machine Learning Notebook as my tool to write some SQL and PL/SQL to build out a model to predict Good or Bad wines, based on the Prices and other characteristics of the wine. A REST API was built using this model to allow for a developer to pass in wine descriptors and returns two values to indicate if it would be a Good or Bad wine and the probability of this prediction.

No data is stored in the database. I only use the machine learning model to make the prediction

I built out the REST API using APEX, and here is a screenshot of the GET API setup.

Here is an example of some Python code to call the machine learning model to make a prediction.

import json
import requests

country = 'Portugal'
province = 'Douro'
variety = 'Portuguese Red'
price = '30'

resp = requests.get('https://jggnlb6iptk8gum-adw2.adb.us-ashburn-1.oraclecloudapps.com/ords/oml_user/wine/wine_pred/'+country+'/'+province+'/'+'variety'+'/'+price)
json_data = resp.json()
print (json.dumps(json_data, indent=2))

—–

{
  "pred_wine": "LT_90_POINTS",
  "prob_wine": 0.6844716987704507
}

But does this scale, as in how many concurrent users and REST API calls can it handle at the same time.

To test this I multi-threaded processes in Python to call a Python function to call the API, while ensuring a range of values are used for the input parameters. Some additional information for my tests.

  • Each function call included two REST API calls
  • Test effect of creating X processes, at same time
  • Test effect of creating X processes in batches of Y agents
  • Then, the above, with function having one REST API call and also having two REST API calls, to compare timings
  • Test in range of parallel process from 10 to 1,000 (generating up to 2,000 REST API calls at a time)

Some of the results. The table shows the time(*) in seconds to complete the number of processes grouped into batches (agents). My laptop was the limiting factor in these tests. It wasn’t able to test when the number of parallel processes when above 500. That is why I broke them into batches consisting of X agents

* this is the total time to run all the Python code, including the time taken to create each process.

Some observations:

  • Time taken to complete each function/process was between 0.45 seconds and 1.65 seconds, for two API calls.
  • When only one API call, time to complete each function/process was between 0.32 seconds and 1.21 seconds
  • Average time for each function/process was 0.64 seconds for one API functions/processes, and 0.86 for two API calls in function/process
  • Table above illustrates the overhead associated with setting up, calling, and managing these processes

As you can see, even with the limitations of my laptop, using an Oracle Database, in-database machine learning and REST can be used to create a Micro-Service type machine learning scoring engine. Based on these numbers, this machine learning micro-service would be able to handle and process a large number of machine learning scoring in Real-Time, and these numbers would be well within the maximum number of such calls in most applications. I’m sure I could process more parallel processes if I deployed on a different machine to my laptop and maybe used a number of different machines at the same time

How many applications within you enterprise needs to process move than 6,000 real-time machine learning scoring per minute?  This shows us the Oracle Always Free offering is capable and suitable for most applications.

Now, if you are processing more than those numbers per minutes then perhaps you need to move onto the paid options.

What next? I’ll spin up two VMs on Oracle Always Free, install Python, copy code into these VMs and have then run in parallel 🙂

 

Irish Whiskey Distilleries Data Set

Posted on Updated on

I’ve been building some Irish Whiskey data sets, and the first of these data sets contains details of all the Whiskey Distilleries in Ireland. This page contains the following:

  • Table describing the attributes/features of the data set
  • Data set, in a scroll able region
  • Download data set in different formats
  • Map of the Distilleries
  • Subscribe to Twitter List containing these Distilleries, and some Twitter Hash Tags
  • How to send me updates, corrections and details of Distilleries I’ve missed

If you use this data set (and my other data sets) make sure to add a reference back to data set webpage. Let me know if you use the data set is an interesting way, share the details with me and I’ll share it on my blog and social media for you.

This data set will have it’s own Irish Distilleries webpage and all updates to the data set and other information will be made there. Check out that webpage for the latest version of things.

Data Set Description

Data set contains 45 Distilleries.

ATTRIBUTE NAME DESCRIPTION
Distillery Name of the Distillery
County County / Area where distillery is located
Address Full address of the distillery
EIRCODE EirCode for distillery in Ireland. Distilleries in Northern Ireland will not have an EIRCODE
NI_Postcode Post code of distilleries located in Northern Ireland
Tours Does the distillery offer tours to visitors (Yes/No)
Web_Site Web site address
Twitter The twitter name of the distillery
Lat Latitude for the distillery
Long Longitude for the distillery
Notes_Parent_Company Contains other descriptive information about the distillery, founded by, parent company, etc.

Data Set (scroll able region)

Data set contains 45 Distilleries.

DISTILLERY COUNTY ADDRESS EIRCODE NI_POSTCODE TOURS WEB_SITE TWITTER LAT LONG NOTES_PARENT_COMPANY
Ballykeefe Distillery Kilkenny Kyle, Ballykeefe, Cuffsgrange, County Kilkenny, R95 NR50, Ireland R95 NR50 Yes https://ballykeefedistillery.ie  @BallykeefeD 52.602034 -7.375774 Ging Family
Belfast Distillery Antrim Crumlin Road Goal, Crumlin Road, Belfast, BT14 6ST, United Kingdom BT14 6ST No http://www.belfastdistillery.com  @BDCIreland 54.609718 -5.941994 J&J McConnell
Blacks Distillery Cork Farm Lane, Kinsale, Co. Cork P17 XW70 No https://www.blacksbrewery.com  @BlacksBrewery 51.710969 -8.515579
Blackwater Waterford Church Road, Ballinlevane East, Ballyduff, Co. Waterford, P51 C5C6 P51 C5C6 No https://blackwaterdistillery.ie/  @BlackDistillery 52.147581 -8.052973
Boann Louth Lagavooren, Platin Rd., Drogheda, Co. Louth, A92 X593 A92 X593 Yes http://boanndistillery.ie/  @Boanndistillery 53.69459 -6.366558 Cooney Family
Bow Street Dublin Bow St, Smithfield Village, Dublin 7 D07 N9VH Yes https://www.jamesonwhiskey.com/en-IE/visit-us/jameson-distillery-bow-st  @jamesonireland 53.348415 -6.277266 Pernod Ricard
Bushmills Distillery Antrim 2 Distillery Rd, Bushmills BT57 8XH, United Kingdom BT57 8XH Yes https://bushmills.com  @BushmillsGlobal 55.202936 -6.517221
Cape Clear Cork Cape Clear Island, Knockannamaurnagh, Skibbereen, Co. Cork P81 RX70 No https://www.capecleardistillery.com/  @capedistillery 51.4509 -9.483047
Clonakilty Cork The Waterfront, Clonakilty, Co. Cork P85 EW82 Yes https://www.clonakiltydistillery.ie/  @clondistillery 51.62165 -8.8855 Scully Family
Connacht Whiskey Distillery Mayo Belleek, Ballina, Co Mayo, F26 P932 F26 P932 Yes https://connachtwhiskey.com  @connachtwhiskey 54.122131 -9.143779
Cooley Distillery Louth Dundalk Rd, Maddox Garden, Carlingford, Dundalk, Co. Louth A91 FX98 Yes 53.996544 -6.221563 Beam Suntory
Copeland Distillery Down 43 Manor Street, Donaghadee, Co Down, Northern Ireland, BT21 0HG BT21 0HG Yes https://copelanddistillery.com @CopelandDistill 54.642699 -5.532739
Dingle Distillery Kerry Farranredmond, DIngle, Co. Kerry V92 E7YD Yes https://dingledistillery.ie/  @DingleWhiskey 52.141928 -10.289287
Dublin Liberties Dublin 33 Mill Street, Dublin 8, D08 V221 D08 V221 Yes https://thedld.com  @WeAreTheDLD 53.337343 -6.276367
Echlinville Distillery Down 62 Gransha Rd, Kircubbin, Newtownards BT22 1AJ, United Kingdom BT22 1AJ Yes https://echlinville.com/  @Echlinville 54.46909 -5.509397
Glendalough Wicklow Unit 9 Newtown Business And Enterprise Centre, Newtown Mount Kennedy, Co. Wicklow, A63 A439 A63 A439 No https://www.glendaloughdistillery.com/  @GlendaloughDist 53.085011 -6.1074016 Mark Anthony Brands International
Great Northern Distillery Louth Carrickmacross Road, Dundalk, Co. Louth, Ireland, A91 P8W9 A91 P8W9 No https://gndireland.com/  @GNDistillery 54.001574 -6.40964 Teeling Family, formally of Cooley Distillery
Hinch Distillery Down 19 Carryduff Road, Boardmills, Ballynahinch, Down, United Kingdom BT27 6TZ No https://hinchdistillery.com/  @hinchdistillery 54.461021 -5.903713
Kilbeggan Distillery Westmeath Lower Main St, Aghamore, Kilbeggan, Co. Westmeath, Ireland N91 W67N Yes https://www.kilbegganwhiskey.com  @Kilbeggan 53.369369 -7.502809 Beam Suntory
Kinahan’s Distillery Dublin 44 Fitzwilliam Place, Dublin D02 P027 No https://kinahanswhiskey.com @KinahansLL Sources Whiskey from around ireland
Lough Gill Sligo Hazelwood Avenue, Cams, Co. Sligo F91 Y820 F91 Y820 Yes https://www.athru.com/  @athruwhiskey 54.255318 -8.433156
Lough Mask Mayo Drioglann Loch Measc Teo, Killateeaun, Tourmakeady, Co. Mayo F12 PK75 Yes https://www.loughmaskdistillery.com/  @lough_mask 53.611819 -9.444077 David Raethorne
Lough Ree Longford Main Street, Lanesborough, Co. Longford N39 P229 No https://www.lrd.ie  @LoughReeDistill 53.673328 -7.99043
Matt D’Arcy Down 27 St Marys St, Newry BT34 2AA, United Kingdom BT34 2AA No http://www.mattdarcys.com  @mattdarcys 54.172817 -6.339367
Midleton Distillery Cork Old Midleton Distillery, Distillery Walk, Midleton, Co. Cork.  P25 Y394 P25 Y394 Yes https://www.jamesonwhiskey.com/en-IE/visit-us/jameson-distillery-midleton  @jamesonireland 51.916344 -8.165174 Pernod Ricard
Nephin Mayo Nephin Whiskey Company, Nephin Square, Lahardane, Co. Mayo F26 W2H9 No http://nephinwhiskey.com/  @NephinWhiskey 54.029011 -9.32211
Pearse Lyons Distillery Dublin 121-122 James’s Street Dublin 8, D08 ET27 D08 ET27 Yes https://www.pearselyonsdistillery.com  @PLDistillery 53.343708 -6.289351
Powerscourt Wicklow Powerscourt Estate, Enniskerry, Co. Wicklow, A98 A9T7 A98 A9T7 Yes https://powerscourtdistillery.com/  @PowerscourtDist 53.184167 -6.190794
Rademon Estate Distillery Down Rademon Estate Distillery, Downpatrick, County Down, United Kingdom BT30 9HR Yes https://rademonestatedistillery.com  @RademonEstate 54.396039 -5.790968
Roe & Co Dublin 92 James’s Street, Dublin 8 D08 YYW9 Yes https://www.roeandcowhiskey.com 53.343731 -6.285673
Royal Oak Distillery Carlow Clorusk Lower, Royaloak, Co. Carlow R21 KR23 Yes https://royaloakdistillery.com/  @royaloakwhiskey 52.703341 -6.978711 Saronno
Scotts Irish Distillery Fermanagh Main Street, Garrison, Co Fermanagh, BT93 4ER, United Kingdom BT93 4ER No http://scottsirish.com 54.417726 -8.083534
Skellig Six 18 Distillery Kerry Valentia Rd, Garranearagh, Cahersiveen, Co. Kerry, V23 YD89 V23 YD89 Yes https://skelligsix18distillery.ie  @SkelligSix18 51.935701 -10.239549
Slane Castle Distillery Meath Slane Castle, Slane, Co. Meath C15 F224 Yes https://www.slaneirishwhiskey.com/  @slanewhiskey 53.711065 -6.562735 Brown-Forman & Conyngham Family
Sliabh Liag Donegal Line Road, Carrick, Co Donegal, F94 X9DX F94 X9DX Yes https://www.sliabhliagdistillers.com/  @sliabhliagdistl 54.6545 -8.633847
Teeling Whiskey Distillery Dublin 13-17 Newmarket, The Liberties, Dublin 8, D08 KD91 D08 KD91 Yes https://teelingwhiskey.com/  @TeelingWhiskey 53.337862 -6.277123 Teeling Family
The Quiet Man Derry 10 Rossdowney Rd, Londonderry BT47 6NS, United Kingdom BT47 6NS No http://www.thequietmanirishwhiskey.com/  @quietmanwhiskey 54.995344 -7.301312 Niche Drinks
The Shed Distillery Leitrim Carrick on shannon Road, Drumshanbo, Co. Leitrim N41 R6D7 No http://thesheddistillery.com/  @SHEDDISTILLERY 54.047145 -8.04358
Thomond Gate Distillery Limerick No https://thomondgatewhiskey.com/ @ThomondW Nicholas Ryan
Tipperary Tipperary Newtownadam, Cahir, Co. Tipperary No http://tipperarydistillery.ie/  @TippDistillery 52.358622 -7.881875
Tullamore Distillery Offaly Bury Quay, Tullamore, Co. Offaly R35 XW13 Yes https://www.tullamoredew.com  @TullamoreDEW 53.377774 -7.492944
Walsh Whiskey Distillery Carlow Equity House, Deerpark Business Park, Dublin Rd, Carlow R93 K7W4 No http://walshwhiskey.com/  @walshwhiskey 52.853417 -6.883916 Walsh Family
Waterford Distillery Waterford 9 Mary Street, Grattan Quay, Waterford City, Co. Waterford X91 KF51 No https://waterfordwhisky.com/  @waterforddram 52.264308 -7.120997 Renegade Spirits Ireland Ltd
Wayward Irish Distillery Kerry Lakeview House & Estate, Fossa Road, Maulagh, Killarney, Co. Kerry, V93 F7Y5 V93 F7Y5 No https://www.waywardirish.com  @wayward_irish 52.071045 -9.590709 O’Connell Fomily
West Cork Distillers Cork Marsh Rd, Marsh, Skibbereen, Co. Cork P81 YY31 No http://www.westcorkdistillers.com/  @WestCorkDistill 51.557804 -9.268941

Download Data Set

Irish_Whiskey_Distilleries – Excel Spreadsheet

Irish_Whiskey_Distilleries.csv – Zipped CSV file

I’ll be adding some additional formats soon.

Map of Distilleries

Here is a map with the Distilleries plotted using Google Maps.

Screenshot 2020-02-13 15.22.40

Twitter Lists & Twitter Hash Tags

I’ve created a Twitter list containing the twitter accounts for all of these distilleries. You can subscribe to the list to get all the latest posts from these distilleries

Irish Whishkey Distillery Twitter List

Have a look out for these twitter hash tags on a Friday, Saturday or Sunday night, as people from around the world share what whiskeys they are tasting that evening. Irish and Scotish Whiskies are the most common.

#FridayNightDram
#FridayNightSip
#SaturdayNightSip
#SaturdayNightDram
#SundayNightSip
#SundayNightDram

How to send me updates, corrections and details of Distilleries I’ve missed

Let me know, via the my Contact page,  if you see any errors in the data set, especially if I’m missing any distilleries.

Storing and processing Unicode characters in Oracle

Posted on Updated on

Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems (Wikipedia). The standard is maintained by the Unicode Consortium, and contains over 137,994 characters (137,766 graphic characters, 163 format characters and 65 control characters).

The NVARCHAR2 is Unicode data type that can store Unicode characters in an Oracle Database. The character set of the NVARCHAR2 is national character set specified at the database creation time. Use the following to determine the national character set for your database.

SELECT *
FROM nls_database_parameters
WHERE PARAMETER = 'NLS_NCHAR_CHARACTERSET';

For my database I’m using an Oracle Autonomous Database. This query returns the character set AL16UTF16. This character set encodes Unicode data in the UTF-16 encoding and uses 2 bytes to store a character.

When creating an attribute with this data type, the size value (max 4000) determines the number of characters allowed. The actual size of the attribute will be double.

Let’s setup some data to test this data type.

CREATE TABLE demo_nvarchar2 (
   attribute_name NVARCHAR2(100));

INSERT INTO demo_nvarchar2 
VALUES ('This is a test for nvarchar2');

The string is 28 characters long. We can use the DUMP function to see the details of what is actually stored.

SELECT attribute_name, DUMP(attribute_name,1016)
FROM demo_nvarchar2;

The DUMP function returns a VARCHAR2 value that contains the datatype code, the length in bytes, and the internal representation of a value.

 

You can see the difference in the storage size of the NVARCHAR2 and the VARCHAR2 attributes.

Valid values for the return_format are 8, 10, 16, 17, 1008, 1010, 1016 and 1017. These values are assigned the following meanings:


8 – octal notation
10 – decimal notation
16 – hexadecimal notation
17 – single characters
1008 – octal notation with the character set name
1010 – decimal notation with the character set name
1016 – hexadecimal notation with the character set name
1017 – single characters with the character set name

The returned value from the DUMP function gives the internal data type representation. The following table lists the various codes and their description.

Code Data Type
1 VARCHAR2(size [BYTE | CHAR])
1 NVARCHAR2(size)
2 NUMBER[(precision [, scale]])
8 LONG
12 DATE
21 BINARY_FLOAT
22 BINARY_DOUBLE
23 RAW(size)
24 LONG RAW
69 ROWID
96 CHAR [(size [BYTE | CHAR])]
96 NCHAR[(size)]
112 CLOB
112 NCLOB
113 BLOB
114 BFILE
180 TIMESTAMP [(fractional_seconds)]
181 TIMESTAMP [(fractional_seconds)] WITH TIME ZONE
182 INTERVAL YEAR [(year_precision)] TO MONTH
183 INTERVAL DAY [(day_precision)] TO SECOND[(fractional_seconds)]
208 UROWID [(size)]
231 TIMESTAMP [(fractional_seconds)] WITH LOCAL TIMEZONE

 

OCI Data Science – Create a Project & Notebook, and Explore the Interface

Posted on Updated on

In my previous blog post I went through the steps of setting up OCI to allow you to access OCI Data Science. Those steps showed the setup and configuration for your Data Science Team.

Screenshot 2020-02-11 20.46.42

In this post I will walk through the steps necessary to create an OCI Data Science Project and Notebook, and will then Explore the basic Notebook environment.

1 – Create a Project

From the main menu on the Oracle Cloud home page select Data Science -> Projects from the menu.

Screenshot 2020-02-12 12.07.19

Select the appropriate Compartment in the drop-down list on the left hand side of the screen. In my previous blog post I created a separate Compartment for my Data Science work and team. Then click on the Create Projects button.

Screenshot 2020-02-12 12.09.11Enter a name for your project. I called this project, ‘DS-Demo-Project’. Click Create button.

Screenshot 2020-02-12 12.13.44

Screenshot 2020-02-12 12.14.44

That’s the Project created.

2 – Create a Notebook

After creating a project (see above) you can not create one or many Notebook Sessions.

To create a Notebook Session click on the Create Notebook Session button (see the above image).  This will create a VM to contain your notebook and associated work. Just like all VM in Oracle Cloud, they come in various different shapes. These can be adjusted at a later time to scale up and then back down based on the work you will be performing.

The following example creates a Notebook Session using the basic VM shape. I call the Notebook ‘DS-Demo-Notebook’. I also set the Block Storage size to 50G, which is the minimum value. The VNC details have been defaulted to those assigned to the Compartment. Click Create button at the bottom of the page.

Screenshot 2020-02-12 12.22.24

The Notebook Session VM will be created. This might take a few minutes. When created you will see a screen like the following.

Screenshot 2020-02-12 12.31.21

3 – Open the Notebook

After completing the above steps you can now open the Notebook Session in your browser.  Either click on the Open button (see above image), or copy the link and share with your data science team.

Important: There are a few important considerations when using the Notebooks. While the session is running you will be paying for it, even if the session got terminated at the browser or you lost connect. To manage costs, you may need to stop the Notebook session. More details on this in a later post.

After clicking on the Open button, a new browser tab will open and will ask you to log-in.

Screenshot 2020-02-12 12.35.26

After logging in you will see your Notebook.

Screenshot 2020-02-12 12.37.42

4 – Explore the Notebook Environment

The Notebook comes pre-loaded with lots of goodies.

The menu on the left-hand side provides a directory with lots of sample Notebooks, access to the block storage and a sample getting started Notebook.

Screenshot 2020-02-12 12.41.09

When you are ready to create your own Notebook you can click on the icon for that.

Screenshot 2020-02-12 12.42.50

Or if you already have a Notebook, created elsewhere, you can load that into your OCI Data Science environment.

Screenshot 2020-02-12 12.44.50

The uploaded Notebook will appear in the list on the left-hand side of the screen.

OCI Data Science – Initial Setup and Configuration

Posted on Updated on

After a very, very, very long wait (18+ months) Oracle OCI Data Science platform is now available.

But before you jump straight into using OCI Data Science, there is a little bit of setup required for your Cloud Tenancy. There is the easy simple approach and then there is the slightly more involved approach. These are

  • Simple approach. Assuming you are just going to use the root tenancy and compartment, you just need to setup a new policy to enable the use of the OCI Data Science services. This assuming you have your VNC configuration complete with NAT etc. This can be done by creating a policy with the following policy statement. After creating this you can proceed with creating your first notebook in OCI Data Science.
allow service datascience to use virtual-network-family in tenancy

Screenshot 2020-02-11 19.46.38

  • Slightly more complicated approach. When you get into having a team based approach you will need to create some additional Oracle Cloud components to manage them and what resources are allocated to them. This involved creating Compartments, allocating users, VNCs, Policies etc. The following instructions brings you through these steps

IMPORTANT: After creating a Compartment or some of the other things listed below, and they are not displayed in the expected drop-down lists etc, then either refresh your screen or log-out and log back in again!

1. Create a Group for your Data Science Team & Add Users

The first step involves creating a Group to ‘group’ the various users who will be using the OCI Data Science services.

Go to Governance and Administration ->Identity and click on Groups.

Enter some basic descriptive information. I called my Group, ‘my-data-scientists’.

Now click on your Group in the list of Groups and add the users to the group.

You may need to create the accounts for the various users.

Screenshot 2020-02-11 12.03.58

2. Create a Compartment for your Data Science work

Now create a new Compartment to own the network resources and the Data Science resources.

Go to Governance and Administration ->Identity and click on Compartments.

Enter some basic descriptive information. I’ve called my compartment, ‘My-DS-Compartment’.

3. Create Network for your Data Science work

Creating and setting up the VNC can be a little bit of fun. You can do it the manual way whereby you setup and configure everything. Or you can use the wizard to do this. I;m going to show the wizard approach below.

But the first thing you need to do is to select the Compartment the VNC will belong to. Select this from the drop-down list on the left hand side of the Virtual Cloud Network page. If your compartment is not listed, then log-out and log-in!

To use the wizard approach click the Networking QuickStart button.

Screenshot 2020-02-11 20.15.28

Select the option ‘VCN with Internet Connectivity and click Start Workflow, as you will want to connect to it and to allow the service to connect to other cloud services.

Screenshot 2020-02-11 20.17.22

I called my VNC ‘My-DS-vnc’ and took the default settings. Then click the Next button.

Screenshot 2020-02-11 20.19.31

The next screen shows a summary of what will be done. Click the Create button, and all of these networking components will be created.

Screenshot 2020-02-11 20.22.39

All done with creating the VNC.

4. Create required Policies enable OCI Data Science for your Compartment

There are three policies needed to allocated the necessary resources to the various components we have just created. To create these go to Governance and Administration ->Identity and click on Policies.

Select your Compartment from the drop-down list. This should be ‘My-DS-Compartment’, then click on Create Policy.

The first policy allocates a group to a compartment for the Data Science services. I called this policy, ‘DS-Manage-Access’.

allow group My-data-scientists to manage data-science-family in compartment My-DS-Compartment

Screenshot 2020-02-11 20.30.10

The next policy is to give the Data Science users access to the network resources. I called this policy, ‘DS-Manage-Network’.

allow group My-data-scientists to use virtual-network-family in compartment My-DS-Compartment

Screenshot 2020-02-11 20.37.47

And the third policy is to give Data Science service access to the network resources. I called this policy, ‘DS-Network-Access’.

allow service datascience to use virtual-network-family in compartment My-DS-Compartment

Screenshot 2020-02-11 20.41.01

Job Done 🙂

You are now setup to run the OCI Data Science service.  Check out my Blog Post on creating your first OCI Data Science Notebook and exploring what is available in this Notebook.

Python-Connecting to multiple Oracle Autonomous DBs in one program

Posted on Updated on

More and more people are using the FREE Oracle Autonomous Database for building new new applications, or are migrating to it.

I’ve previously written about connecting to an Oracle Database using Python. Check out that post for details of how to setup Oracle Client and the Oracle Python library cx_Oracle.

In thatblog post I gave examples of connecting to an Oracle Database using the HostName (or IP address), the Service Name or the SID.

But with the Autonomous Oracle Database things are a little bit different. With the Autonomous Oracle Database (ADW or ATP) you will need to use an Oracle Wallet file. This file contains some of the connection details, but you don’t have access to ServiceName/SID, HostName, etc.  Instead you have the name of the Autonomous Database. The Wallet is used to create a secure connection to the Autonomous Database.

You can download the Wallet file from the Database console on Oracle Cloud.

Screenshot 2020-01-10 12.24.10

Most people end up working with multiple database. Sometimes these can be combined into one TNSNAMES file. This can make things simple and easy. To use the download TNSNAME file you will need to set the TNS_ADMIN environment variable. This will allow Python and cx_Oracle library to automatically pick up this file and you can connect to the ATP/ADW Database.

But most people don’t work with just a single database or use a single TNSNAMES file. In most cases you need to switch between different database connections and hence need to use multiple TNSNAMES files.

The question is how can you switch between ATP/ADW Database using different TNSNAMES files while inside one Python program?

Use the os.environ setting in Python. This allows you to reassign the TNS_ADMIN environment variable to point to a new directory containing the TNSNAMES file. This is a temporary assignment and over rides the TNS_ADMIN environment variable.

For example,

import cx_Oracle
import os

os.environ['TNS_ADMIN'] = "/Users/brendan.tierney/Dropbox/wallet_ATP"

p_username = ''p_password = ''p_service = 'atp_high'
con = cx_Oracle.connect(p_username, p_password, p_service)

print(con)
print(con.version)
con.close()

I can now easily switch to another ATP/ADW Database, in the same Python program, by changing the value of os.environ and opening a new connection.

import cx_Oracle
import os

os.environ['TNS_ADMIN'] = "/Users/brendan.tierney/Dropbox/wallet_ATP"
p_username = ''
p_password = ''
p_service = 'atp_high'
con1 = cx_Oracle.connect(p_username, p_password, p_service)
...
con1.close()

...
os.environ['TNS_ADMIN'] = "/Users/brendan.tierney/Dropbox/wallet_ADW2"
p_username = ''
p_password = ''
p_service = 'ADW2_high'
con2 = cx_Oracle.connect(p_username, p_password, p_service)
...
con2.close()

As mentioned previously the setting and resetting of TNS_ADMIN using os.environ, is only temporary, and when your Python program exists or completes the original value for this environment variable will remain.

Applying a Machine Learning Model in OAC

Posted on Updated on

There are a number of different tools and languages available for machine learning projects. One such tool is Oracle Analytics Cloud (OAC).  Check out my article for Oracle Magazine that takes you through the steps of using OAC to create a Machine Learning workflow/dataflow.

Screenshot 2019-12-19 14.31.24

Oracle Analytics Cloud provides a single unified solution for analyzing data and delivering analytics solutions to businesses. Additionally, it provides functionality for processing data, allowing for data transformations, data cleaning, and data integration. Oracle Analytics Cloud also enables you to build a machine learning workflow, from loading, cleaning, and transforming data and creating a machine learning model to evaluating the model and applying it to new data—without the need to write a line of code. My Oracle Magazine article takes you through the various tasks for using Oracle Analytics Cloud to build a machine learning workflow.

That article covers the various steps with creating a machine learning model. This post will bring you through the steps of using that model to score/label new data.

In the Data Flows screen (accessed via Data->Data Flows) click on Create. We are going to create a new Data Flow to process the scoring/labeling of new data.

Screenshot 2019-12-19 15.08.39

Select Data Flow from the pop-up menu. The ‘Add Data Set’ window will open listing your available data sets. In my example, I’m going to use the same data set that I used in the Oracle Magazine article to build the model.  Click on the data set and then click on the Add button.

Screenshot 2019-12-19 15.14.44

The initial Data Flow will be created with the node for the Data Set. The screen will display all the attributes for the data set and from this you can select what attributes to include or remove. For example, if you want a subset of the attributes to be used as input to the machine learning model, you can select these attributes at this stage. These can be adjusted at a later stages, but the data flow will need to be re-run to pick up these changes.

Screenshot 2019-12-19 15.17.48

Next step is to create the Apply Model node. To add this to the data flow click on the small plus symbol to the right of the Data Node. This will pop open a window from which you will need to select the Apply Model.

Screenshot 2019-12-19 15.22.40

A pop-up window will appear listing the various machine learning models that exist in your OAC environment. Select the model you want to use and click the Ok button.

Screenshot 2019-12-19 15.24.42

Screenshot 2019-12-19 15.25.22

The next node to add to the data flow is to save the results/outputs from the Apply Model node. Click on the small plus icon to the right of the Apply Model node and select Save Results from the popup window.

Screenshot 2019-12-19 15.27.50.png

We now have a completed data flow. But before you finish edit the Save Data node to give a name for the Save Data Set, and you can edit what attributes/features you want in the result set.

Screenshot 2019-12-19 15.30.25.png

You can now save and run the Data Flow, and view the outputs from applying the machine learning model. The saved data set results can be viewed in the Data menu.

Screenshot 2019-12-19 15.35.11