Machine Learning

Data Normalization in Oracle Data Mining

Posted on Updated on

Normalization is the process of scaling continuous values down to a specific range, often between zero and one. Normalization transforms each numerical value by subtracting a number, called the shift, and dividing the result by another number called the scale. The normalization techniques include:

  • Min-Max Normalization : There is where the normalization is based on the using the minimum value for the shift and the (maximum-minimum) for the scale.
  • Scale Normalization : This is where the normalization is based on zero being used for the shift and the value calculated using max[abs(max), abs(min)] being used for the scale
  • Z-Score Normalization : This is where the normalization is based on using the mean value for the shift and the standard deviation for the scale.

When using Automatic Data Processing the normalization functions are used. But sometimes you may want to process the data is a more explicit manner. To do so you can use the various normalization function. To use these there is a three stage process. The first stage involves the creation of a table that will contain the normalization transformation data. The second stage applies the normalization procedures to your data source, defines the normalization required and inserts the required transformation data  into the table create during the first stage. The third stage involves the defining of a view that applies the normalization transformations to your data source and displays the output via a database view. The following example illustrates how you can normalize the AGE and YRS_RESIDENCE attributes. The input data source will be the view that was created as the output of the previous transformation (MINING_DATA_V_2). This is passed on the original MINING_DATA_BUILD_V data set. The final output from this transformation step and all the other data transformation steps is MINING_DATA_READY_V.

BEGIN
   -- Clean-up : Drop the previously created tables
   BEGIN
      execute immediate 'drop table TRANSFORM_NORMALIZE';
   EXCEPTION
      WHEN others THEN
         null;
   END;

   -- Stage 1 : Create the table for the transformations
   -- Perform normalization for: AGE and YRS_RESIDENCE
   dbms_data_mining_transform.CREATE_NORM_LIN (
      norm_table_name => 'MINING_DATA_NORMALIZE');       

   -- Step 2 : Insert the normalization data into the table
   dbms_data_mining_transform.INSERT_NORM_LIN_MINMAX (
      norm_table_name => 'MINING_DATA_NORMALIZE',
      data_table_name => 'MINING_DATA_V_2',
      exclude_list    => DBMS_DATA_MINING_TRANSFORM.COLUMN_LIST (
                         'affinity_card',
                         'bookkeeping_application',
                         'bulk_pack_diskettes',
                         'cust_id',
                         'flat_panel_monitor',
                         'home_theater_package',
                         'os_doc_set_kanji',
                         'printer_supplies',
                         'y_box_games'));

   -- Stage 3 : Create the view with the transformed data
   DBMS_DATA_MINING_TRANSFORM.XFORM_NORM_LIN (
      norm_table_name => 'MINING_DATA_NORMALIZE',
      data_table_name => 'MINING_DATA_V_2',
      xform_view_name => 'MINING_DATA_READY_V');

END;
/

The above example performs normalization based on the Minimum-Maximum values of the variables/columns. The other normalization functions are:

INSERT_NORM_LIN_SCALE Inserts linear scale normalization definitions in a transformation definition table.
INSERT_NORM_LIN_ZSCORE Inserts linear zscore normalization definitions in a transformation definition table.

 

Advertisements

HiveMall: Docker image setup

Posted on Updated on

In a previous blog post I introduced HiveMall as a SQL based machine learning language available for Hadoop and integrated with Hive.

If you have your own Hadoop/Big Data environment, I provided the installation instructions for Hivemall, in that blog post

An alternative is to use Docker. There is a HiveMall Docker image available. A little warning before using this image. It isn’t updated with the latest release but seems to get updated twice a year. Although you may not be running the latest version of HiveMall, you will have a working environment that will have almost all the functionality, bar a few minor new features and bug fixes.

To get started, you need to make sure you have Docker running on your machine and you have logged into your account. The docker image is available from Docker Hub. Take note of the version number for the latest version of the docker image. In this example it is 20180924

Screenshot 2019-03-04 10.53.57

Open a terminal window and run the following command. This will download and extract all the image files.

docker pull hivemall/latest:20180924

Screenshot 2019-03-04 10.51.34

Until everything is completed.

Screenshot 2019-03-04 11.01.40

This docker image has HDFS, Yarn and MapReduce installed and running. This will require the exposing of the ports for these services 8088, 50070 and 19888.

To start the HiveMall docker image run

docker run -p 8088:8088 -p 50070:50070 -p 19888:19888 -it hivemall/latest:20180924

Consider creating a shell script for this, to make it easier each time you want to run the image.

Screenshot 2019-03-04 11.15.04

Now seed Hive with some data. The typical example uses the IRIS data set.  Run the following command to do this. This script downloads the IRIS data set, creates a number directories and then creates an external table, in Hive, to point to the IRIS data set.

cd $HOME && ./bin/prepare_iris.sh

Screenshot 2019-03-04 11.20.49

Now open Hive and list the databases.

hive -S
hive> show databases;
OK
default
iris
Time taken: 0.131 seconds, Fetched: 2 row(s)

Connect to the IRIS database and list the tables within it.

hive> use iris;
hive> show tables;
iris_raw

Now query the data (150 records)

hive> select * from iris_raw;
1 Iris-setosa [5.1,3.5,1.4,0.2]
2 Iris-setosa [4.9,3.0,1.4,0.2]
3 Iris-setosa [4.7,3.2,1.3,0.2]
4 Iris-setosa [4.6,3.1,1.5,0.2]
5 Iris-setosa [5.0,3.6,1.4,0.2]
6 Iris-setosa [5.4,3.9,1.7,0.4]
7 Iris-setosa [4.6,3.4,1.4,0.3]
8 Iris-setosa [5.0,3.4,1.5,0.2]
9 Iris-setosa [4.4,2.9,1.4,0.2]
10 Iris-setosa [4.9,3.1,1.5,0.1]
11 Iris-setosa [5.4,3.7,1.5,0.2]
12 Iris-setosa [4.8,3.4,1.6,0.2]
13 Iris-setosa [4.8,3.0,1.4,0.1
...

Find the min and max values for each feature.

hive> select 
    > min(features[0]), max(features[0]),
    > min(features[1]), max(features[1]),
    > min(features[2]), max(features[2]),
    > min(features[3]), max(features[3])
    > from
    > iris_raw;

4.3  7.9  2.0  4.4  1.0  6.9  0.1  2.5

You are now up and running with HiveMall on Docker.

HiveML : Using SQL for ML on Big Data

Posted on Updated on

It is widely recognised that SQL is one of the core languages that every data scientist needs to know. Not just know but know really well. If you are going to be working with data (big or small) you are going to use SQL to access the data. You may use some other tools and languages as part of your data science role, but for processing data SQL is king.

During the era of big data and hadoop it was all about moving the code to where the data was located. Over time we have seem a number of different languages and approaches being put forward to allow us to process the data in these big environments. One of the most common one is Spark. As with all languages there can be a large learning curve, and as newer languages become popular, the need to change and learn new languages is becoming a lot more frequent.

We have seen many of the main stream database vendors including machine learning in their databases, thereby allowing users to use machine learning using SQL. In the big data world there has been many attempts to do this, to building some SQL interfaces for machine learning in a big data environment.

One such (newer) SQL machine learning engine is called HiveMall. This will allow anyone with a basic level knowledge of SQL to quickly learn machine learning. Apache Hivemall is built to be a scalable machine learning library that runs on Apache Hive, Apache Spark, and Apache Pig.

Screenshot 2019-02-16 09.46.39

Hivemall is currently at incubator stage under Apache and version 0.6 was released in December 2018.

I’ve a number of big data/hadoop environments in my home lab and build on a couple of cloud vendors (Oracle and AWS). I’ve completed the installation of Hivemall easily on my Oracle BigDataLite VM and my own custom build Hadoop environment on Oracle cloud. A few simple commands you will have Hivemall up and running. Initially installed for just Hive and then updated to use Spark.

Hivemall expands the analytical functions available in Hive, as well as providing data preparation and the typical range of machine learning functions that are necessary for 97+% of all machine learning use cases.

Download the hivemall-core-xxx-with-dependencies.jar file

# Setup Your Environment $HOME/.hiverc
add jar /home/myui/tmp/hivemall-core-xxx-with-dependencies.jar; 
source /home/myui/tmp/define-all.hive;

This automatically loads all Hivemall functions every time you start a Hive session

# Create a directory in HDFS for the JAR 
hadoop fs -mkdir -p /apps/hivemall 
hdfs dfs -chmod -R 777 /apps/hivemall 
cp hivemall-core-0.4.2-rc.2-with-dependencies.jar hivemall-with-dependencies.jar 
hdfs dfs -put hivemall-with-dependencies.jar /apps/hivemall/ 
hdfs dfs -put hivemall-with-dependencies.jar /apps/hive/warehouse

You might want to create a new DB in Hive for your Hivemall work.

CREATE DATABASE IF NOT EXISTS hivemall;
USE hivemall;

Then list all the Hivemall functions

show functions "hivemall.*";

+-----------------------------------------+--+
| tab_name                                |
+-----------------------------------------+--+
| hivemall.add_bias                       |
| hivemall.add_feature_index              |
| hivemall.amplify                        |
| hivemall.angular_distance               |
| hivemall.angular_similarity             |
...

Hivemall for ML using SQL is now up and running. Next step is to do try out the various analytical and ML functions.

 

Ethics in the AI, Machine Learning, Data Science, etc Era

Posted on Updated on

Ethics is one of those topics that everyone has a slightly different definition or view of what it means. The Oxford English dictionary defines ethics as, ‘Moral principles that govern a person’s behaviour or the conducting of an activity‘.

As you can imagine this topic can be difficult to discuss and has many, many different aspects.

In the era of AI, Machine Learning, Data Science, etc the topic of Ethics is finally becoming an important topic. Again there are many perspective on this. I’m not going to get into these in this blog post, because if I did I could end up writing a PhD dissertation on it. 

But if you do work in the area of AI, Machine Learning, Data Science, etc you do need to think about the ethical aspects of what you do. For most people, you will be working on topics where ethics doesn’t really apply. For example, examining log data, looking for trends, etc

But when you start working of projects examining individuals and their behaviours then you do need to examine the ethical aspects of such work. Everyday we experience adverts, web sites, marketing, etc that has used AI, Machine Learning and Data Science to delivery certain product offerings to us.

Just because we can do something, doesn’t mean we should do it.

One particular area that I will not work on is Location Based Advertising. Imagine walking down a typical high street with lots and lots of retail stores. Your phone vibrates and on the screen there is a message. The message is a special offer or promotion for one of the shops a short distance ahead of you. You are being analysed. Your previous buying patterns and behaviours are being analysed, Your location and direction of travel is being analysed. Some one, or many AI applications are watching you. This is not anything new and there are lots of examples of this from around the world.

But what if this kind of Location Based Advertising was taken to another level. What if the shops had cameras that monitored the people walking up and down the street. What if those cameras were analysing you, analysing what clothes you are wearing, analysing the brands you are wearing, analysing what accessories you have, analysing your body language, etc. They are trying to analyse if you are the kind of person they want to sell to. They then have staff who will come up to you, as you are walking down the street, and will have customised personalised special offers on products in their store, just for you.

See the segment between 2:00 and 4:00 in this video.  This gives you an idea of what is possible.

Are you Ok with this?

As an AI, Machine Learning, Data Science professional, are you Ok with this?

The technology exists to make this kind of Location Based Marketing possible. This will be an increasing ethical consideration over the coming years for those who work in the area of AI, Machine Learning, Data Science, etc

Just because we can, doesn’t mean we should!

Machine Learning Models in Python – How long does it take

Posted on Updated on

We keep hearing from people about all the computing resources needed for machine learning. Sometimes it can put people off from trying it as they will think I don’t have those kind of resources.

This is another blog post in my series on ‘How long does it take to create a machine learning model?

Check out my previous blog post that used data sets containing 72K, 210K, 660K, 2M and 10M records.

There was some surprising results in those these.

In this test, I’ll be using Python and SciKitLearn package to create models using the same algorithms. There are a few things to keep in mind. Firstly, although they maybe based on the same algorithms, the actual implementation of them will be different in each environment (SQL vs Python).

With using Python for machine learning, one of the challenges we have is getting access to the data. Assuming the data lives in a Database then time is needed to extract that data to the local Python environment. Secondly, when using Python you will be using a computer with significantly less computing resources than a Database server. In this test I used my laptop (MacBook Pro). Thirdly, when extracting the data from the database, what method should be used.

I’ve addressed these below and the Oracle Database I used was the DBaaS I used in my first experiment. This is a Database hosted on Oracle Cloud.

Extracting Data to CSV File
This kind of depends on how you do this. There are hundreds of possibilities available to you, but if you are working with an Oracle Database you will probably be using SQL Developer. I used the ‘export’ option to create a CSV file for each of the data sets. The following table shows how long it took for each data set.

As you can see this is an incredibly slow way of exporting this data. Like I said, there are quicker ways of doing this.

After downloading the data sets, the next step is to see how load it takes to load these CSV files into a pandas data frame in Python. The following table show the timings in seconds.

You can see that Python is very efficient at loading these data sets into a pandas data frame in my Python environment.

Extracting Data using cx_Oracle Python package
As I’ll be using Python to create the models and the data exists in an Oracle Database (on Oracle Cloud), I can use the cx_Oracle package to download the data sets into my Python environment. After using the cx_Oracle package to download the data I then converted it into a pandas data frame.

You can see that using cx_Oracle to download the data is a very efficient way of accessing the data.  But if the data already exists in CSV files, then the previous method would be quicker to use.

I had the array fetch size set to 10,000.  I also experimented with smaller and larger numbers for the array fetch size, but 10,000 seemed to give a quickest results.

How long to create Machine Learning Models in Python
Now we get onto checking out the timings of how long it takes to create a number of machine learning models using different algorithms and using the default settings. The algorithms include Naive Bayes, Decision Tree, GLM, SVM and Neural Networks.

I had to stop including SVM in the tests as it was taking way too long to run. For example I killed the SVM model build on the 210K data set after it was running for 5 hours.

The Neural Network models created had 3 hidden layers.

In addition to creating the models, there was some minor data preparation steps performed including factorizing, normalization and one-hot-coding. This data preparation would be comparable to the automatic data preparation steps performed by Oracle, although Oracle Automatic Data Preparation does a bit of extra work.

At the point I would encourage you to look back at my previous blog posts on timings using Oracle DBaaS and ADW.  You will see that Python, in these test cases, was quicker at creating the machine learning models. But with Python the data needed to be extracted from the database and that can take time!

A separate consideration is being able to deploy the models. The time it takes to build models is perhaps not the main consideration. You need to consider ease of deployment and use of the models.

How long does it take to build a Machine Learning model using Oracle Cloud

Posted on Updated on

Everyday someone talks about the the processing power needed for Machine Learning, and the vast computing needed for these tasks. It has become evident that most of these people have never created a machine learning model. Never. But like to make up stuff and try to make themselves look like an expert, or as I and others like to call them a “fake expert”.

When you question these “fake experts” about this topic, they huff and puff about lots of things and never answer the question or try to claim it is so difficult, you simply don’t understand.

Having worked in the area of machine learning for a very very long time, I’ve never really had performance issues with creating models. Yes most of the time I’ve been able to use my laptop. Yes my laptop to build models large models. In a couple of these my laptop couldn’t cope and I moved onto a server.

But over the past few years we keep hearing about using cloud services for machine learning. If you are doing machine learning you need to computing capabilities that are available with cloud services.

So, the results below show the results of building machine learning models, using different algorithms, with different sizes of data sets.

For this test, I used a basic cloud service. Well maybe it isn’t basic, but for others they will consider it very basic with very little compute involved.

I used an Oracle Cloud DBaaS for this experiment. I selected an Oracle 18c Extreme edition cloud service. This comes with the in-database machine learning option. This comes with 1 OCPUs, 7.5G Memory and 170GB storage. This is the basic configuration.

Next I created data sets with different sizes. These were based on one particular data set, as this ensures that as the data set size increases, the same kind of data and processing required remained consistent, instead of using completely different data sets.

The data set consisted of the following number of records, 72K, 660K, 210K, 2M, 10M and 50M.

I then created machine learning models using Decisions Tree, Naive Bayes, Support Vector Machine, Generaliszd Linear Models (GLM) and Neural Networks. Yes it was a typical classification problem.

The following table below shows the length of time in seconds to build the models. All data preparations etc was done prior to this.

Note: It should be noted that Automatic Data Preparation was turned on for these algorithms. This performed additional algorithm specific data preparation for each model. That means the times given in the following tables is for some data preparation time and for building the models.

ml_on_dbaas_1

Converting the above table into minutes.

ml_on_dbaas_2

It is clear that the Neural Network model takes a lot longer to build than all the other algorithms. In this test the Neural Network model had only one hidden layer.
When we chart the build timings, leaving out Neural Networks, we get.
ml_on_dbaas_3 
We can see Naive Bayes, Decision Tree, GLM and SVM algorithms have very similar model build timings, but as the data volumes increase the Decision Tree algorithm become less efficient.
Overall it doesn’t take a long time to build models. In a way it is a very trivial task!
I mentioned at the start of this post I had created a data set of 50M records. Unfortunately I wasn’t able to get models build for this data set using this cloud instance. It used used so much TEMP tablespace that the file volumes on my cloud instance ran out of space!
I suppose if I wanted to go bigger with my data, I needed a bigger boat!
I haven’t included any timings for model scoring using these models. Why? the scored data is immediately returned event for large the largest data sets.

 

Lessor known Apache Machine Learning Languages

Posted on Updated on

Machine learning is a very popular topic in recent times, and we keep hearing about languages such as R, Python and Spark. In addition to these we have commercially available machine learning languages and tools from SAS, IBM, Microsoft, Oracle, Google, Amazon, etc., etc. Everyone want a slice of the machine learning market!

The Apache Foundation supports the development of new open source projects in a number of areas. One such area is machine learning. If you have read anything about machine learning you will have come across Spark, and maybe you might believe that everyone is using it. Sadly this isn’t true for lots of reasons, but it is very popular. Spark is one of the project support by the Apache Foundation.

But are there any other machine learning projects being supported under the Apache Foundation that are an alternative to Spark? The follow lists the alternatives and lessor know projects: (most of these are incubator/retired/graduated Apache projects)

Flink Flink is an open source system for expressive, declarative, fast, and efficient data analysis. Stratosphere combines the scalability and programming flexibility of distributed MapReduce-like platforms with the efficiency, out-of-core execution, and query optimization capabilities found in parallel databases. Flink was originally known as Stratosphere when it entered the Incubator.

Documentation

(graduated)

HORN HORN is a neuron-centric programming APIs and execution framework for large-scale deep learning, built on top of Apache Hama.

Wiki Page

 

(Retired)

HiveMail Hivemall is a library for machine learning implemented as Hive UDFs/UDAFs/UDTFs

Apache Hivemall offers a variety of functionalities: regression, classification, recommendation, anomaly detection, k-nearest neighbor, and feature engineering. It also supports state-of-the-art machine learning algorithms such as Soft Confidence Weighted, Adaptive Regularization of Weight Vectors, Factorization Machines, and AdaDelta. Apache Hivemall offers a variety of functionalities: regression, classification, recommendation, anomaly detection, k-nearest neighbor, and feature engineering. It also supports state-of-the-art machine learning algorithms such as Soft Confidence Weighted, Adaptive Regularization of Weight Vectors, Factorization Machines, and AdaDelta.

Documentation

(incubator)

MADlib Apache MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data. Key features include: Operate on the data locally in-database. Do not move data between multiple runtime environments unnecessarily; Utilize best of breed database engines, but separate the machine learning logic from database specific implementation details; Leverage MPP shared nothing technology, such as the Greenplum Database and Apache HAWQ (incubating), to provide parallelism and scalability.

Documentation

(graduated)

MXNet A Flexible and Efficient Library for Deep Learning . MXNet provides optimized numerical computation for GPUs and distributed ecosystems, from the comfort of high-level environments like Python and R MXNet automates common workflows, so standard neural networks can be expressed concisely in just a few lines of code.

Webpage

(incubator)

OpenNLP OpenNLP is a machine learning based toolkit for the processing of natural language text. OpenNLP supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, language detection and coreference resolution.

Documentation

(graduated)

PredictionIO PredictionIO is an open source Machine Learning Server built on top of state-of-the-art open source stack, that enables developers to manage and deploy production-ready predictive services for various kinds of machine learning tasks.

Documentation

(graduated)

SAMOA SAMOA provides a collection of distributed streaming algorithms for the most common data mining and machine learning tasks such as classification, clustering, and regression, as well as programming abstractions to develop new algorithms that run on top of distributed stream processing engines (DSPEs). It features a pluggable architecture that allows it to run on several DSPEs such as Apache Storm, Apache S4, and Apache Samza.

Documentation

(incubator)

SINGA SINGA is a distributed deep learning platform. An intuitive programming model based on the layer abstraction is provided, which supports a variety of popular deep learning models. SINGA architecture supports both synchronous and asynchronous training frameworks. Hybrid training frameworks can also be customized to achieve good scalability. SINGA provides different neural net partitioning schemes for training large models.

Documentation

(incubator)

Storm Storm is a distributed, fault-tolerant, and high-performance realtime computation system that provides strong guarantees on the processing of data. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language.

Documentation

(graduated)

SystemML SystemML provides declarative large-scale machine learning (ML) that aims at flexible specification of ML algorithms and automatic generation of hybrid runtime plans ranging from single node, in-memory computations, to distributed computations such as Apache Hadoop MapReduce and Apache Spark.

Documentation

(graduated)

Big data ml

I will have a closer look that the following SQL based machine learning languages in a lager blog post:

MADlib

Storm