Machine Learning

Transforming Missing Data using Oracle Data Mining

Posted on Updated on

In a previous post I showed how you can normalize data using the in-database machine learning feature using the DBMS_DATA_MINING.TRANSFORM function.  This same function can be used to perform many more data transformations with standardized routines. When it comes to missing data, where you have some case records where the value for an attribute is missing you have a number of options open to you. The first is to evaluate the degree of missing values for the attribute for the data set as a whole. If it is very high, you may want to remove that attribute from the data set. But in scenarios when you have a small number or percentage of missing values you will want to find an appropriate or an approximate value. Such calculations can involve the use of calculating the mean or mode.

To build this up using DBMS_DATA_MINING.TRANSFORM function, we need to follow a simple three stage process. The first stage creates a table that will contain the details of the transformations. The second stage defines and runs the transformation function to calculate the replacement values and finally, the third stage, to create the necessary records in the table created in the previous stage. These final two stages need to be followed for both numerical and categorical attributes. For the final stage you can create a new view that contains the data from the original table and has the missing data rules generated in the second stage applied to it. The following example illustrates these two stages for numerical and categorical attributes in the MINING_DATA_BUILD_V data set.

-- Transform missing data for numeric attributes
-- Stage 1 : Clean up, if previous run
--    transformed missing data for numeric and categorical
--    attributes.
BEGIN
   --
   -- Clean-up : Drop the previously created tables
   --
   BEGIN
      execute immediate 'drop table TRANSFORM_MISSING_NUMERIC';
   EXCEPTION
      WHEN others THEN
         null;
   END;

   BEGIN
      execute immediate 'drop table TRANSFORM_MISSING_CATEGORICAL';
   EXCEPTION
      WHEN others THEN
         null;
   END;

Now for stage 2 to define the functions to calculate the missing values for Numerical and Categorical variables.

-- Stage 2 : Perform the transformations
--    Exclude any attributes you don't want transformed
--      e.g. the case id and the target attribute

   --
   -- Transform the numeric attributes
   --
   dbms_data_mining_transform.CREATE_MISS_NUM (
      miss_table_name => 'TRANSFORM_MISSING_NUMERIC');

   dbms_data_mining_transform.INSERT_MISS_NUM_MEAN (
    miss_table_name => 'TRANSFORM_MISSING_NUMERIC',
    data_table_name => 'MINING_DATA_BUILD_V',
    exclude_list    => DBMS_DATA_MINING_TRANSFORM.COLUMN_LIST (
                       'affinity_card',
                       'cust_id'));

   --
   -- Transform the categorical attributes
   --
   dbms_data_mining_transform.CREATE_MISS_CAT (
      miss_table_name => 'TRANSFORM_MISSING_CATEGORICAL');

   dbms_data_mining_transform.INSERT_MISS_CAT_MODE (
      miss_table_name => 'TRANSFORM_MISSING_CATEGORICAL',
      data_table_name => 'MINING_DATA_BUILD_V',
      exclude_list    => DBMS_DATA_MINING_TRANSFORM.COLUMN_LIST (
                         'affinity_card',
                         'cust_id'));
END;

When the above code completes the two transformation tables, TRANSFORM_MISSING_NUMERIC and TRANSFORM_MISSING_CATEGORICAL, will exist in your schema.

Querying these two tables shows the table attributes along with the value to be used to relate the missing value. For example the following illustrates the missing data transformations for the categorical data.

SELECT col, 
       val 
FROM transform_missing_categorical;

For the sample data set used in these examples we get.

COL                       VAL
------------------------- -------------------------
CUST_GENDER               M
CUST_MARITAL_STATUS       Married
COUNTRY_NAME              United States of America
CUST_INCOME_LEVEL         J: 190,000 - 249,999
EDUCATION                 HS-grad
OCCUPATION                Exec.
HOUSEHOLD_SIZE            3

For stage three you will need to create a new view (MINING_DATA_V). This combines the data from original table and the missing data rules generated in the second stage applied to it. This is built in stages with an initial view (MINING_DATA_MISS_V) created that merges the data source and the transformations for the missing numeric attributes. This view (MINING_DATA_MISS_V) will then have the transformations for the missing categorical attributes applied to create the a new view called MINING_DATA_V that contains all the missing data transformations.

BEGIN
   -- xform input data to replace missing values
   -- The data source is MINING_DATA_BUILD_V
   -- The output is MINING_DATA_MISS_V

   DBMS_DATA_MINING_TRANSFORM.XFORM_MISS_NUM(
      miss_table_name => 'TRANSFORM_MISSING_NUMERIC',
      data_table_name => 'MINING_DATA_BUILD_V',
      xform_view_name => 'MINING_DATA_MISS_V');

   -- xform input data to replace missing values
   -- The data source is MINING_DATA_MISS_V
   -- The output is MINING_DATA_V
   DBMS_DATA_MINING_TRANSFORM.XFORM_MISS_CAT(
      miss_table_name => 'TRANSFORM_MISSING_CATEGORICAL',
      data_table_name => 'MINING_DATA_MISS_V',
      xform_view_name => 'MINING_DATA_V');
END;

You can now query the MINING_DATA_V view and see that the data displayed will not contain any null values for any of the attributes.

 

Examples of using Machine Learning on Video and Photo in Public

Posted on Updated on

Over the past 18 months or so most of the examples of using machine learning have been on looking at images and identifying objects in them. There are the typical examples of examining pictures looking for a Cat or a Dog, or some famous person, etc. Most of these examples are very noddy, although they do illustrate important examples.

But what if this same technology was used to monitor people going about their daily lives. What if pictures and/or video was captured of you as you walked down the street or on your way to work or to a meeting. These pictures and videos are being taken of you without you knowing.

And this raises a wide range of Ethical concerns. There are the ethics of deploying such solutions in the public domain, but there are also ethical concerns for the data scientists, machine learner, and other people working on these projects. “Just because we can, doesn’t mean we should”. People need to decide, if they are working on one of these projects, if they should be working on it and if not what they can do.

Ethics are the principals of behavior based on ideas of right and wrong. Ethical principles often focus on ideas such as fairness, respect, responsibility, integrity, quality, transparency and trust.  There is a lot in that statement on Ethics, but we all need to consider that is right and what is wrong. But instead of wrong, what is grey-ish, borderline scenarios.

Here are some examples that might fall into the grey-ish space between right and wrong. Why they might fall more towards the wrong is because most people are not aware their image is being captured and used, not just for a particular purpose at capture time, but longer term to allow for better machine learning models to be built.

Can you imagine walking down the street with a digital display in front of you. That display is monitoring you, and others, and then presents personalized adverts on the digital display aim specifically at you. A classify example of this is in the film Minority Report. This is no longer science fiction.

Screenshot 2019-05-10 14.12.55

This is happening at the Westfield shopping center in London and in other cities across UK and Europe. These digital advertisement screens are monitoring people, identifying their personal characteristics and then customizing the adverts to match in with the profile of the people walking past. This solutions has been developed and rolled out by Ocean Out Door. They are using machine learning to profile the individual people based on gender, age, facial hair, eye wear, mood, engagement, attention time, group size, etc. They then use this information to:

  1. Optimisation – delivering the appropriate creative to the right audience at the right time.
  2. Visualise – Gaze recognition to trigger creative or an interactive experience
  3. AR Enabled – Using the HD cameras to create an augmented reality mirror or window effect, creating deep consumer engagement via the latest technology
  4. Analytics – Understanding your brand’s audience, post campaign analysis and creative testing

Screenshot 2019-05-10 14.19.35.png

Face Plus Plus can monitor people walking down the street and do similar profiling, and can bring it to another level where by they can identify what clothing you are wearing and what the brand is. Image if you combine this with location based services. An example of this, imagine you are walking down the high street or a major retail district. People approach you trying to entice you into going into a particular store, and they offer certain discounts. But you are with a friend and the store is not interested in them.

Screenshot 2019-05-10 14.28.23

The store is using video monitoring, capturing details of every person walking down the street and are about to pass the store. The video is using machine/deep learning to analyze you profile and what brands you are wearing. The store as a team of people who are deployed to stop and engage with certain individuals, just because they make the brands or interests of the store and depending on what brands you are wearing can offer customized discounts and offers to you.

How comfortable would you be with this? How comfortable would you be about going shopping now?

For me, I would not like this at all, but I can understand why store and retail outlets are interested, as they are all working in a very competitive market trying to maximize every dollar or euro they can get.

Along side the ethical concerns, we also have some legal aspects to consider. Some of these are a bit in the grey-ish area, as some aspects of these kind of scenarios are slightly addresses by EU GDPR and the EU Artificial Intelligence guidelines. But what about other countries around the World. Then it comes to training and deploying these facial models, they are dependent on having a good training data set. This means they needs lots and lots of pictures of people and these pictures need to be labelled with descriptive information about the person. For these public deployments of facial recognition systems, then will need more and more training samples/pictures. This will allow the models to improve and evolve over time. But how will these applications get these new pictures? They claim they don’t keep any of the images of people. They only take the picture, use the model on it, and then perform some action. They claim they do not keep the images! But how can they improve and evolve their solution?

I’ll have another blog post giving more examples of how machine/deep learning, video and image captures are being used to monitor people going about their daily lives.

 

HiveMall: Transform Categorical features to Numerical

Posted on Updated on

HiveMall is a machine learning library that sits on top of Hive and provides SQL interface to wide range of data preparation and machine learning algorithms.

A common task faced for many machine learning exercises is to convert the data from the format it is captured in (raw data) into a format that is required by the machine learning algorithms. Most ML tools will either have functionality built into the algorithms to do this automatically or will provide functions to allow you to manage this process yourself.

In HiveMall we have the ‘quantified_features’ function and is used for transforming values of non-number columns to indexed numbers, but it does have some unusual but useful features.

In this example I’ll use the titanic data set to illustrate the usage of this feature.

Screenshot 2019-04-29 15.14.42

Here we have a mixture of features with categorical and numerical.

select 
  quantified_features(
    ${output_row}, PassengerId, Survived, Pclass, Sex, Age, SibSp, Parch, Fare, Cabin, Embarked) as features
from (
  select * from titanic
  order by Passengerid asc
) t
limit 5;

and we get the following output

[1.0,0.0,0.0,3.0,0.0,22.0,1.0,0.0,7.25,0.0,1.0]
[2.0,1.0,1.0,1.0,1.0,38.0,1.0,0.0,71.2833,1.0,2.0]
[3.0,1.0,1.0,3.0,1.0,26.0,0.0,0.0,7.9250,0.0,1.0]
[4.0,1.0,1.0,1.0,1.0,35.0,1.0,0.0,53.1,3.0,1.0]
[5.0,1.0,0.0,3.0,0.0,35.0,0.0,0.0,8.05,0.0,1.0]

The ordering within the attributes is important, and some thinking is needed if there is a defined order and you want this reflected in the outputs of the transformed features

If you are a numeric field that you want treated as a categorical, and transformed, you can cast it into a string

e.g.

cast(SibSp as string)

Migrating Python ML Models to other languages

Posted on Updated on

I’ve mentioned in a previous blog post about experiencing some performance issues with using Python ML in production. We needed something quicker and the possible languages we considered were C, C++, Java and Go Lang.

But the data science team used R and Python, with just a few more people using Python than R on the team.

One option was to rewrite everything into the language used in production. As you can imagine no-one wanted to do that and there was no way of ensure a bug free solution and one that gave similar results to the R and Python models. The other option was to look for some code to convert the models from one language to another.

The R users was well versed in using PMML. Predictive Model Markup Language (PMML) has been around a long time and well known and used by certain groups of data scientists who have been around a while. It is also widely supported by many analytics vendors, and provides an inter-change format to allow predictive models to be described and exchanged. For newer people, they hadn’t heard of it. PMML is an XML based interchange specification.

But with PMML there are some limitation. Not with the specification but how it is implemented by the various vendors that support it. PMML supports the exchange of the model pipeline including the data transformations as well as the model specification. Most vendors only support some elements of this and maybe just a couple of models. And there-in lies the problem. How can a ML pipeline be migrated from, as Python, to some other language and/or tool. There are limitations.

If you do want to explore PMML with Python check out the sklearn2pmml package and is also available on PyPl. This package allows you to export the ML pipeline and the model specification. As with most other implementations of PMML there are some parts of the PMML specification not implement, but it is better than post of the other implementation out there.

An alternative is to look at code translations options. With these we want something that will take our ML pipeline and convert it to another programming language like C++, JAVA, Go, etc. There aren’t too many solutions available to do this. One such solution we’ve explored over the past couple of weeks is called m2cgen.

m2cgen (Model 2 Code Generator) is a lightweight library which provides an easy way to transpile trained statistical models into a native code (Python, C, Java, Go). You can supply M2cgen with a range of models (linear, SVM, tree, random forest, or boosting, etc) and the tool will output code in the chosen language that will represent the trained model. The code generated will generated into native code without dependencies. Other packages or libraries are not dependent or required in the translated language. For example here is an example Decision Tree translated into a number of different languages.

 

C

#include <string.h>
void score(double * input, double * output) {
    double var0[3];
    if ((input[2]) <= (2.6)) {
        memcpy(var0, (double[]){1.0, 0.0, 0.0}, 3 * sizeof(double));
    } else {
        if ((input[2]) <= (4.8500004)) {
            if ((input[3]) <= (1.6500001)) {
                memcpy(var0, (double[]){0.0, 1.0, 0.0}, 3 * sizeof(double));
            } else {
                memcpy(var0, (double[]){0.0, 0.3333333333333333, 0.6666666666666666}, 3 * sizeof(double));
            }
        } else {
            if ((input[3]) <= (1.75)) {
                memcpy(var0, (double[]){0.0, 0.42857142857142855, 0.5714285714285714}, 3 * sizeof(double));
            } else {
                memcpy(var0, (double[]){0.0, 0.0, 1.0}, 3 * sizeof(double));
            }
        }
    }
    memcpy(output, var0, 3 * sizeof(double));
}

Java

public class Model {

    public static double[] score(double[] input) {
        double[] var0;
        if ((input[2]) <= (2.6)) {
            var0 = new double[] {1.0, 0.0, 0.0};
        } else {
            if ((input[2]) <= (4.8500004)) {
                if ((input[3]) <= (1.6500001)) {
                    var0 = new double[] {0.0, 1.0, 0.0};
                } else {
                    var0 = new double[] {0.0, 0.3333333333333333, 0.6666666666666666};
                }
            } else {
                if ((input[3]) <= (1.75)) {
                    var0 = new double[] {0.0, 0.42857142857142855, 0.5714285714285714};
                } else {
                    var0 = new double[] {0.0, 0.0, 1.0};
                }
            }
        }
        return var0;
    }
}

Go Lang

func score(input []float64) []float64 {
    var var0 []float64
    if (input[2]) <= (2.6) {
        var0 = []float64{1.0, 0.0, 0.0}
    } else {
        if (input[2]) <= (4.8500004) {
            if (input[3]) <= (1.6500001) {
                var0 = []float64{0.0, 1.0, 0.0}
            } else {
                var0 = []float64{0.0, 0.3333333333333333, 0.6666666666666666}
            }
        } else {
            if (input[3]) <= (1.75) {
                var0 = []float64{0.0, 0.42857142857142855, 0.5714285714285714}
            } else {
                var0 = []float64{0.0, 0.0, 1.0}
            }
        }
    }
    return var0
}

 

Machine Learning with Go Lang

Posted on Updated on

Recently I’ve been having a number of conversations with people in several countries about using Go Lang for machine learning. Most of these people have been struggling with using Python for machine learning and are looking for an alternative that will give them better performance. We have been experimenting with C++ and Go Lang to see what the performance differences are. Most of these are with the execution of the ML code. This is great and everyone is very happy with execution timings, compared to Python.

But, there is a flip side to this. Although we have faster execution timings, there is a down side in that the coding effort is higher, with more lines of code and fewer libraries/packages to support the various ML tasks. But most of these can be easily coded ourselves .

We also looked at some frameworks for converting ML models developed in one language but deployed in production using a different language. More on that in another post.

Overall the extra development work was considered worthwhile for the performance improvement and deployment gains.

Go Lang doesn’t really come with it’s own set of libraries/packages for ML, but those have a number of these that can be used to code up the necessary functions we need for our everyday ML needs.

But are there any Go Lang libraries/packages developed for ML, just like we have for the R Language, etc?  The simple answer is YES we have. But the number of these is small in comparison to R and Python. Both of these languages are interpreted languages. But those available for Go are slowly growing.

Here is list of the Go Lang libraries/packages that we examined and evaluated for these projects. Some are available from the Go Lang website/wiki and others are available on Github.

  • Anna – Artificial Neural Network Aspiration, aims to be self-learning and self-improving software.
  • bayesian – A naive bayes classifier.
  • Dialex – Dialex is a smart pipe that unscrambles text and makes it machine-readable.
  • Cloudforest – Ensembles of decision trees
  • ctw – Context Tree Weighting and Rissanen-Langdon Arithmetic Coding
  • eaopt – An evolutionary optimization library.
  • evo – a framework for implementing evolutionary algorithms in Go.
  • gobrain – Neural Networks
  • Go Learn – Machine Learning for Go
  • go-algs/maxflow Maxflow (graph-cuts) energy minimization library.
  • go-graph – Graph library for Go/Golang language
  • go-galib – Genetic algorithms.
  • go-pr – Pattern recognition package in Go lang
  • golinear – Linear SVM and logistic regression.
  • go-mind – A neural network library built in Go
  • go_ml – Linear Regression, Logistic Regression, Neural Networks, Collaborative Filtering, Gaussian Multivariate Distribution.
  • go-ml-transpiler – An open source Go transpiler for machine learning models.
  • go-mxnet-predictor – Go binding for MXNet c_predict_api to do inference with pre-trained model.
  • gorgonia – Neural network primitives library (like Theano or Tensorflow but for Go)
  • go-porterstemmer – An efficient native Go clean room implementation of the Porter Stemming algorithm.
  • go-pr – Gaussian classifier.
  • ntmNeural Turing Machines implementation
  • paicehusk – Go implementation of the Paice/Husk Stemmer
  • RF – Random forests implementation in Go
  • tfgo – Tensorflow + Go, the gopher way.

 

Machine Learning Tools and Workbenches

Posted on Updated on

The following is a list of the most commonly used tools and workbenches for machine learning. These are specific to machine learning only. This list does not include any library or frameworks. These are tools and workbenches only. Most offering machine learning tools will include the following features:

  • Easy drag and drop capabilities
  • Data collection
  • Data preparation and cleaning
  • Model building
  • Data Visualization
  • Model Deployment
  • Integration with other tools and languages

As more and more organizations implement machine learning, there are two core aims they want to achieve.

  1. Employee Productivity: Who wants to spend days or weeks writing mundane code to load data, clean data, etc etc etc. No one wants to do this and especially employers don’t want their staff wasting time on this. Instead they are happy to invest in tools and workbenches where a lot or most or all of these mundane tasks are automated for you. You can not concentrate on the important tasks of adding value to your organisation. This saves money, improves employee productivity and employee value.
  2. Integration with Technical Architecture: Many of these tools and workbenches allow for easy integration with the technical architecture and thereby allowing easy and quick integration of machine learning withe the day to day activities of the organization. This saves money, improves employee productivity and employee value.

SAS

SAS software has been around for every and is the great grand-daddy of analytics and machine learning. They have built a large number of machine learning tools and solutions built upon these for various industries. Their core machine learning tools include SAS Enterprise Miner and SAS Visual Data Mining and Machine Learning.

Microsoft

Microsoft have been improving their Machine Learning offering over the years and most of this is based on the Azure cloud platform with Microsoft Azure Machine Learning Studio and Azure Databricks.

SAP

SAP Leonardo is a cloud based platform for machine learning and supports tight integration with other SAP software.

Oracle

Oracle have a number of machine learning tools and supports for the main machine learning languages. They have built a large number of applications (both cloud and on-premises) with in-built machine learning. Their main tools for machine learning include Oracle Data Miner, Oracle Machine Learning and Oracle Analytics (OAC or DVD versions)

Cloudera

If you work with hadoop and big data then you are probably using Cloudera in some way. Cloudera have hired Hilary Mason as their GM of ML. By taking an “AI factory” approach to turning data into decisions, you can make the process of building, scaling, and deploying enterprise ML and AI solutions automated, repeatable, and predictable—boring even. Cloudera Data Science Workbench is their solution.

Screenshot 2019-04-17 13.10.46

IBM

IBM have a number of machine learning tools, one of them being a long standing member of the machine learning community, SPSS Modeler. Other machine learning tools include Watson Studio, IBM Machine Learning for z/OS, and IBM Watson Explorer.

Google

Google have a large number of machine learning solutions including everything from traditional machine learning, into NLP, in Image processing, Video processing, etc. It’s a long list. Many of these come with various APIs to access these features. Most of these revolve around their Google AI Cloud offering. But sticking with the tools and workbenches we have AI Platform Notebooks, Kubeflow, and BigQuery ML.

TensorBoard

TensorBoard is a suite of tools for graphical representation of different aspects and stages of machine learning in TensorFlow.

Amazon

A bit like Goolge, Amazon has a large number of solutions for machine learning and AI, and most of these are available via an API or some cloud service. Amazon SageMaker is their main service.

Looker

Looker connects directly with Google BQML reduces additional complexity for data scientists by eliminating the need to move outputs of predictive models back into the database for use, while also increases the time-to-value for business users, allowing them to operationalize the outputs of predictive metrics to make better decisions every day.

Weka

Weka has been around for a long time and still popular in some research groups. Weka is a collection of machine learning algorithms for data mining tasks. It contains tools for data preparation, classification, regression, clustering, association rules mining, and visualization.

RapidMiner

RapidMiner Studio has been around for a long time and is one of the few more visual workflow tools (that everyone else should be doing).

Databricks

From the people who created Spark, we have another notebook solution for your machine learning projects called Databricks Workbench.

KNIME

KNIME Analytics Platform is the open source software for creating data science applications and services.

Dataiku

Dataiki Data Science (DSS) is a collaborative data science software workflow platform enabling data exploration, prototyping and delivery of analytical and machine learning solutions.

 

I’ve not included the tools like R Studio and Notebooks in this list as they don’t really address the aims listed above. But you will notice a lot of the above solutions are really Jupyter Notebooks. Most of these vendors have a long way to go to make the tasks of machine learning boring.

This list does not cover all available tools and workbenches, but it does list the most common one you will come across.

Data Sets for Analytics

Posted on Updated on

When working with analytics, in whatever flavor, one of the key things you need is some data. But data comes in many different shapes and sizes, but where can you get some useful data, be it transactional, time-series, meta-data, analytical, master, categorical, numeric, regression, clustering, etc.

Many of the popular analytics languages have some data sets built into them. For example the R language comes pre-loaded with data sets and these can be accessed using

data()

but many of the R packages also come with data sets.

Similarly if you are using Python, it comes with some pre-loaded data sets and similarly many of the Python libraries have data sets build into them. For example scikit learn.

from sklearn import datasets

But where else can you get data sets. There are lots and lots of website available with data sets and the list could be very long. The following is a list of, what I consider, the websites with the best data sets.

Kaggle

Amazon Open Data

UCI Machine Learning Repository

Google Search Engine

Google Open Images Data

Google Fiance

Microsoft Open Data

Awesome Public Datasets Collection

EU Open Data

US Government Data

US Census Bureau

Ireland Open Data

Northern Ireland Public Open Data

UK Open Data

Image Processing Data

Carnegie Mellon University Data Sets

World Bank Open Data

IMF Open Data

Movie Reviews Data Set

Amazon Reviews

Amazon public data sets

IMDb Datasets