Machine Learning

Always watching, always listening. Be careful with your data

Posted on Updated on

The saying ‘Big Brother is Watching’ has been around a long time and typically gets associated with government organisations. But over the past few years we have a few new Big Brothers appearing. These are in the form of Google and Facebook and a few others.

These companies gather lots and lots. Some companies gather enormous amounts of data. This data will include details of your interactions with the companies through various websites, applications, etc. But some are gathering data in ways that you might not be aware. For example, take this following video. Data is being gathered about what you do and where you go even if you have disconnected your phone.

Did you know this kind of data was being gathered about you?

Just think of what they could be doing with that data, that data you didn’t know they were gathering about you. Companies like these generate huge amounts of income from selling advertisements and the more data they have about individuals the more the can understand what they might be interested. The generate customer profiles and sell expensive advertising based on having these very detailed customer profiles.

But it doesn’t stop there. Recently Google bought Fitbit. Just think about what they can do now. Combining their existing profiles of you as a person with you activities throughout every day, week and month. Just think about how various health and insurance companies would love to have this data. Yes they would and companies like Google would be able to charge these companies even more money for this level of detail on individuals/customers.

But it doesn’t stop there. There have been lots of reports of various apps sharing health and other related data with various companies, without their customers being aware this is happening.

What about Google Assistant? In a recent article by MIT Technology Review title Inside Amazon’s plan for Alexa to run your entire life, they discuss how Alexa can be used to control virtually everything. In this article Alexa’s cheif scientist say “plan is for the voice assistant to move from passive to proactive interactions. Rather than wait for and respond to requests, Alexa will anticipate what the user might want. The idea is to turn Alexa into an omnipresent companion that actively shapes and orchestrates your life. This will require Alexa to get to know you better than ever before.”  When combined with other products this will allow “these new products let Alexa listen to and log data about a dramatically larger portion of your life“.

Just imagine if Google did the same with their Google Assistant!  Big Brother isn’t just Watching, they are also Listening!

There has been some recent report of Google looking to get into Banking by offering checking accounts. The project, code-named Cache, is due to launch in 2020. Google has partnered with Citigroup and a credit union at Stanford University, which will administer the accounts. Users will be able to access their accounts through Google’s digital payment platform, Google Pay.

And there are the reports of Google having access to the health records of over 50 million people. In addition to this, Google has signed a deal with Ascension, the second-largest hospital system in the US, to collect and analyze millions of Americans’ personal health data. Ascension operates in 150 hospitals in 21 states.

What if they also had access to your banking details and spending habits? Google is looking at different options to extend financial products from the google pay into more main stream banking. There has been some recent report of them looking at offering current accounts.

I won’t go discussing their attempts at Ethics and their various (failed) attempts at establishing and Ethics Advisory Board. This has been well documented elsewhere.

Things are getting a bit scary and the saying ‘Big Brother is Watching You’, is very, very true.

In the ever increasing connected world, all of us have a responsibility to know what data companies are gathering on us. We need to decide how comfortable we are with this and if you aren’t then you need to take steps to ensure you protect yourself. Maybe part of this protection requires us to become less connected, stop using some apps, turn off more notification, turn off updates, turn off tracking, etc

While taking each product or offering individually, it may seem ok to us for Google and other companies to offer such services and to analyze our data to provide a better service. But for most people the issues arise when each of these products start to be combined. By doing this they get to have greater access and understanding our our data and our behaviors. What role does (digital) ethics play in all of this? This is something for the company and the employees to decide where things should stop. But when/how do you decide this? when do you/they know things have gone too far? how can you undo some of this work to go back to an acceptable level? what is an acceptable level and how do you define this?

As yo can see there are lots of things to consider and a vital component is the role of (digital) ethics. All organizations who process and analyze data need to have an ethics board and ethics needs to be a core part of every project. To support this everyone needs more training and awareness of ethics and what is acceptable or not.

Demographics vs Psychographics for Machine Learning

Posted on Updated on

When preparing data for data science, data mining or machine learning projects you will create a data set that describes the various characteristics of the subject or case record. Each attribute will contain some descriptive information about the subject and is related to the target variable in some way.

In addition to these attributes, the data set will be enriched with various other internal/external data to complete the data set.

Some of the attributes in the data set can be grouped under the heading of Demographics. Demographic data contains attributes that explain or describe the person or event each case record is focused on. For example, if the subject of the case record is based on Customer data, this is the “Who” the demographic data (and features/attributes) will be about. Examples of demographic data include:

  • Age range
  • Marital status
  • Number of children
  • Household income
  • Occupation
  • Educational level

These features/attributes are typically readily available within your data sources and if they aren’t then these name be available from a purchased data set.

Additional feature engineering methods are used to generate new features/attributes that express meaning is different ways. This can be done by combining features in different ways, binning, dimensionality reduction, discretization, various data transformations, etc. The list can go on.

The aim of all of this is to enrich the data set to include more descriptive data about the subject. This enriched data set will then be used by the machine learning algorithms to find the hidden patterns in the data. The richer and descriptive the data set is the greater the likelihood of the algorithms in detecting the various relationships between the features and their values. These relationships will then be included in the created/generated model.

Another approach to consider when creating and enriching your data set is move beyond the descriptive features typically associated with Demographic data, to include Pyschographic data.

Psychographic data is a variation on demographic data where the feature are about describing the habits of the subject or customer.  Demographics focus on the “who” while psycographics focus on the “why”. For example, a common problem with data sets is that they describe subjects/people who have things in common. In such scenarios we want to understand them at a deeper level. Psycographics allows us to do this. Examples of Psycographics include:

  • Lifestyle activities
  • Evening activities
  • Purchasing interests – quality over economy,  how environmentally concerned are you
  • How happy are you with work, family, etc
  • Social activities and changes in these
  • What attitudes you have for certain topic areas
  • What are your principles and beliefs

The above gives a far deeper insight into the subject/person and helps to differentiate each subject/person from each other, when there is a high similarity between all subjects in the data set. For example, demographic information might tell you something about a person’s age, but psychographic information will tell you that the person is just starting a family and is in the market for baby products.

I’ll close with this. Consider the various types of data gathering that companies like Google, Facebook, etc perform. They gather lots of different types of data about individuals. This allows them to build up a complete and extensive profile of all activities for individuals. They can use this to deliver more accurate marketing and advertising. For example, Google gathers data about what places to visit throughout a data, they gather all your search results, and lots of other activities. They can do a lot with this data. but now they own Fitbit. Think about what they can do with that data and particularly when combined with all the other data they have about you. What if they had access to your medical records too!  Go Google this ! You will find articles about them now having access to your health records. Again combine all of the data from these different data sources. How valuable is that data?

 

Oracle ADW how to load new OML notebooks

Posted on Updated on

Oracle Autonomous Database (ADW) has been out a while now and have had several, behind the scenes, improvements and new/additional features added.

If you have used the Oracle Machine Learning (OML) component of ADW you will have seen the various sample OML Notebooks that come pre-loaded. These are easy to open, use and to try out the various OML features.

Screenshot 2019-07-29 13.07.01

The above image shows the top part of the login screen for OML. To see the available sample notebooks click on the Examples icon. When you do, you will get the following sample OML Notebooks.

Screenshot 2019-07-29 13.08.44

But what if you have a notebook you have used elsewhere. These can be exported in json format and loaded as a new notebook in OML.

To load a new notebook into OML, select the icon (three horizontal line) on the top left hand corner of the screen. Then select Notebooks from the menu.

Screenshot 2019-07-29 13.11.41           Screenshot 2019-07-29 13.21.07

Screenshot 2019-07-29 13.21.49

Then select the Import button located at the top of the Notebooks screen. This will open a File window, where you can select the json file from your file system.

Screenshot 2019-07-29 13.24.58

A couple of seconds later the notebook will be available and listed along side any other notebooks you may have created.

Screenshot 2019-07-29 13.26.13

All done!

You have now imported a new notebook into OML and can now use it to process your data and perform machine learning using the in-database features.

 

 

 

 

 

 

Machine Learning on Mobile Devices

Posted on Updated on

You: What? You can’t be serious?  Machine Learning on Mobile Devices?

Me: The simple answer is ‘Yes you can!”

You: But, what about all the complex data processing, CPU or GPU, and everything else that is needed for machine learning?

Me: Yes you are correct, those things might not be needed. What’s the answer to everything in IT?

You: It Depends ?

Me: Exactly. Yes It Depends on what you are doing. In most cases you don’t need large amounts of machine processing power to do machine learning. Except if you are doing image processing. Then you do need a bit of power to support that work.

You: But how can a mobile device be used for machine learning?

Screenshot 2019-07-19 14.24.22

Me: It Depends! 🙂  It depends on what you are doing. Most of the data processing power needed is for creating the models. That is what most people talk about. Very few people talk about the deployment of machine learning. Deployment, as in, using the machine learning models in your applications.

You: But why mobile devices? That sounds a bit silly?

Me: It does a bit. But when you think about it, how much do you use your mobile phone and tablet?  Where else have you seen mobile devices being used?

You: I use these all the time, to do nearly everything. Just like everyone else I know.

Me: Exactly!  and where else have you seen mobile devices being used?

You: Everywhere! hotels, bars, shops, hospitals, everywhere!

Me: Exactly. And it kind of makes sense to have machine learning scoring done at the point of capture of the data and not some hours or days or weeks later in some data warehouse or something else.

You: But what about the processing power of these devices. They aren’t powerful enough to run the machine learning models? Or are they?

Me: What is a machine learning model? In a simple way it is a mathematical formula of the data that calculates a particular outcome. Something that is a bit more complicated than using a sum function.  Could a mobile device do that easily?

You: Yes. That should be really easy and fast for mobile devices? But machine learning is complex. People keep telling me how complex it is and how difficult it is!

Me: True it can be, but for most problems it can be as simple as writing a few lines of code to create a model. 3-4 lines of code in some languages. But the applying of the the machine learning model can be a simple task (maybe 1 line of code), although some simple data formatting might be needed, but that is a simple task too.

You: So, how can a machine learning model be run on a mobile device?

Me: Programmers write code to run applications on mobile devices. This code can be extended to include the machine learning model. This can be used to score or label the data or do some other processing. A few lines of code.  A good alternative is to create a web service to all the remove scoring of the data.

You: The programming languages used for mobile development are a bit different to most other applications. Surely those mobile device languages don’t support machine learning.

Me: You’d be surprised by what’s available.

You: OK, What languages can I try? Where can I get started?

Me: Check out Firebase ML Kit, Apple CoreML and TensorFlow Lite. Those should be more than enough for you to get started with. There are a few others. But start with those.

You. Brilliant, thank you Brendan. I’ll let you know how I get on with those.

 

 

Embedding Transformation Data Pipeline into ML Model using Oracle Data Mining

Posted on Updated on

I’ve written several blog posts about how to use the DBMS_DATA_MINING.TRANSFORM function to create various data transformations and how to apply these to your data. All of these steps can be simple enough to following and re-run in a lab environment. But the real value with data science and machine learning comes when you deploy the models into production and have the ML models scoring data as it is being produced, and your applications acting upon these predictions immediately, and not some hours or days later when the data finally arrives in the lab environment.

It would be useful to be able to bundle all the transformations into the same process the create the model. The transformations and model become one, together.  If this is possible, then that greatly simplifies how the ML model can be deployed into production. It then becomes a simple function or REST call. We need to keep this simple (KISS).

Using the examples from my previous blog posts performing various data transformations, the following example shows how you can bundle these up into one defined set of transformations and then embed these transformations as part of the ML model. To do this we need to define a list of transformations. We can do this using:

xform_list            IN TRANSFORM_LIST DEFAULT NULL

Where TRANSFORM_LIST has the following structure:

TRANFORM_REC IS RECORD (
     attribute_name       VARCHAR2(4000),
     attribute_subname    VARCHAR2(4000),
     expression           EXPRESSION_REC,
     reverse_expression   EXPRESSION_REC,
     attribute_spec       VARCHAR2(4000));

You can use the DBMS_DATA_MINING.SET_TRANSFORM function to defined the transformations. The following example illustrates the transformation of converting the BOOKKEEPING_APPLICATION attribute from a number data type to a character data type.

DECLARE
   transform_stack   dbms_data_mining_transform.TRANSFORM_LIST;
BEGIN
   dbms_data_mining_transform.SET_TRANSFORM(transform_stack,
                                  'BOOKKEEPING_APPLICATION',
                                  NULL,
                                  'to_char(BOOKKEEPING_APPLICATION)',
                                  'to_number(BOOKKEEPING_APPLICATION)',
                                  NULL);
END;

Alternatively you can use the SET_EXPRESSION function and then create the transformation using it.

You can Stack the transforms together. Using the above example you could express a number of transformations and have these stored in the TRANSFORM_STACK variable. You can then pass this variable into your CREATE_MODEL procedure and have these transformations embedded in your ML model.

 

DECLARE
   transform_stack   dbms_data_mining_transform.TRANSFORM_LIST;
BEGIN
   -- Define the transformation list
   dbms_data_mining_transform.SET_TRANSFORM(transform_stack,
                                  'BOOKKEEPING_APPLICATION',
                                  NULL,
                                  'to_char(BOOKKEEPING_APPLICATION)',
                                  'to_number(BOOKKEEPING_APPLICATION)',
                                  NULL);

   -- Create the data mining model
   DBMS_DATA_MINING.CREATE_MODEL(
      model_name           => 'DEMO_TRANSFORM_MODEL',
      mining_function      => dbms_data_mining.classification,
      data_table_name      => 'MINING_DATA_BUILD_V',
      case_id_column_name  => 'cust_id',
      target_column_name   => 'affinity_card',
      settings_table_name  => 'demo_class_dt_settings',
      xform_list           => transform_stack);
END;

My previous blog posts showed how to create various types of transformations. These transformations were then used to create a view of the data set that included these transformations. To embed these transformations in the ML Model we need to use the  STACK function. The following examples illustrate the stacking of the transformations created in the previous blog posts. These transformations are added (or stacked) to a transformation list and then added to the CREATE_MODEL function, embedding these transformations in the model.

 

DECLARE
   transform_stack   dbms_data_mining_transform.TRANSFORM_LIST;
BEGIN
   -- Stack the missing numeric transformations
   dbms_data_mining_transform.STACK_MISS_NUM (
          miss_table_name   => 'TRANSFORM_MISSING_NUMERIC',
          xform_list        => transform_stack);

   -- Stack the missing categorical transformations
   dbms_data_mining_transform.STACK_MISS_CAT (
          miss_table_name   => 'TRANSFORM_MISSING_CATEGORICAL',
          xform_list        => transform_stack);

   -- Stack the outlier treatment for AGE
   dbms_data_mining_transform.STACK_CLIP (
          clip_table_name   => 'TRANSFORM_OUTLIER',
          xform_list        => transform_stack);

   -- Stack the normalization transformation
   dbms_data_mining_transform.STACK_NORM_LIN (
          norm_table_name   => 'MINING_DATA_NORMALIZE',
          xform_list        => transform_stack);

   -- Create the data mining model
   DBMS_DATA_MINING.CREATE_MODEL(
      model_name           => 'DEMO_STACKED_MODEL',
      mining_function      => dbms_data_mining.classification,
      data_table_name      => 'MINING_DATA_BUILD_V',
      case_id_column_name => 'cust_id',
      target_column_name   => 'affinity_card',
      settings_table_name => 'demo_class_dt_settings',
      xform_list           => transform_stack);
END;

To view the embedded transformations in your data mining model you can use the GET_MODEL_TRANSFORMATIONS function.

SELECT TO_CHAR(expression)
FROM TABLE (dbms_data_mining.GET_MODEL_TRANSFORMATIONS('DEMO_STACKED_MODEL'));

 

TO_CHAR(EXPRESSION)
--------------------------------------------------------------------------------
(CASE  WHEN (NVL("AGE",38.892)<18) THEN 18 WHEN (NVL("AGE",38.892)>70) THEN 70 E
LSE NVL("AGE",38.892) END -18)/52

NVL("BOOKKEEPING_APPLICATION",.880667)
NVL("BULK_PACK_DISKETTES",.628)
NVL("FLAT_PANEL_MONITOR",.582)
NVL("HOME_THEATER_PACKAGE",.575333)
NVL("OS_DOC_SET_KANJI",.002)
NVL("PRINTER_SUPPLIES",1)
(CASE  WHEN (NVL("YRS_RESIDENCE",4.08867)<1) THEN 1 WHEN (NVL("YRS_RESIDENCE",4.
08867)>8) THEN 8 ELSE NVL("YRS_RESIDENCE",4.08867) END -1)/7

NVL("Y_BOX_GAMES",.286667)
NVL("COUNTRY_NAME",'United States of America')
NVL("CUST_GENDER",'M')
NVL("CUST_INCOME_LEVEL",'J: 190,000 - 249,999')
NVL("CUST_MARITAL_STATUS",'Married')
NVL("EDUCATION",'HS-grad')
NVL("HOUSEHOLD_SIZE",'3')
NVL("OCCUPATION",'Exec.')

Transforming Outliers in Oracle Data Mining

Posted on Updated on

In previous posts I’ve shown how to use the DBMS_DATA_MINING.TRANSFORM function to transform data is various ways including, normalization and missing data. In this post I’ll build upon these to show how to outliers can be handled.

The following example will show you how you can transform data to identify outliers and transform them. In the example, Winsorsizing transformation is performed where the outlier values are replaced by the nearest value that is not an outlier.

The transformation process takes place in three stages. For the first stage a table is created to contain the outlier transformation data. The second stage calculates the outlier transformation data and store these in the table created in stage 1. One of the parameters to the outlier procedure requires you to list the attributes you do not the transformation procedure applied to (this is instead of listing the attributes you do want it applied to).  The third stage is to create a view (MINING_DATA_V_2) that contains the data set with the outlier transformation rules applied. The input data set to this stage can be the output from a previous transformation process (e.g. DATA_MINING_V).

BEGIN
   -- Clean-up : Drop the previously created tables
   BEGIN
      execute immediate 'drop table TRANSFORM_OUTLIER';
   EXCEPTION
      WHEN others THEN
         null;
   END;

   -- Stage 1 : Create the table for the transformations
   -- Perform outlier treatment for: AGE and YRS_RESIDENCE
   --
   DBMS_DATA_MINING_TRANSFORM.CREATE_CLIP (
      clip_table_name => 'TRANSFORM_OUTLIER');

   -- Stage 2 : Transform the categorical attributes
   --   Exclude the number attributes you do not want transformed
   DBMS_DATA_MINING_TRANSFORM.INSERT_CLIP_WINSOR_TAIL (
      clip_table_name => 'TRANSFORM_OUTLIER',
      data_table_name => 'MINING_DATA_V',
      tail_frac       => 0.025,
      exclude_list    => DBMS_DATA_MINING_TRANSFORM.COLUMN_LIST (
                          'affinity_card',
                          'bookkeeping_application',
                          'bulk_pack_diskettes',
                          'cust_id',
                          'flat_panel_monitor',
                          'home_theater_package',
                          'os_doc_set_kanji',
                          'printer_supplies',
                          'y_box_games'));

   -- Stage 3 : Create the view with the transformed data
   DBMS_DATA_MINING_TRANSFORM.XFORM_CLIP(
      clip_table_name => 'TRANSFORM_OUTLIER',
      data_table_name => 'MINING_DATA_V',
      xform_view_name => 'MINING_DATA_V_2');
END;

The view MINING_DATA_V_2 will now contain the data from the original data set transformed to process missing data for numeric and categorical data (from previous blog post), and also has outlier treatment for the AGE attribute.

 

 

Examples of Machine Learning with Facial Recognition

Posted on Updated on

In a previous blog post I gave some examples of how facial images recognition and videos are being used in our daily lives. In this post I want to extend this with some additional examples. There are ethical issues around this and in some of these examples their usage has stopped. What is also interesting is the reaction on various social media channels about this. People don’t like it and and happen that some of these have stopped.

But how widespread is this technology? Based on these known examples, and this list is by no means anywhere near complete, but gives an indication of the degree of it’s deployment and how widespread it is.

Dubai is using facial recognition to measure customer satisfaction at four of the Roads and Transport Authority Customer Happiness Centers. They analyze the faces of their customers and rank their level of happiness. They can use this to generate alerts when the happiness levels falls below certain levels.

Screenshot 2019-05-20 10.48.39

Various department stores are using facial recognition throughout the stores and at checkout. These are being used to delivery personalized adverts to users on either in-store screen or on personalized screens on the shopping trolley. And can be used to verify a person’s age if they are buying alcohol or other products. Tesco’s have previously used face-scanning cameras at tills in petrol stations to target advertisements at customers depending on their age and approximate age.

Screenshot 2019-05-20 11.52.55

Some retail stores are using ML to monitor you, monitor what items you pick up and what you pay for at the checkout, identifying any differences and what steps to take next.

In a slight variation of facial recognition, some stores are using similar technology to monitor stock levels, monitor how people interact with different products (e.g pick up one product and then relate it with a similar product), and optimized location of products. Walmart has been a learner in the are of AI and Machine Learning in the retail section for some time now.

The New York Metropolitan Transport Authority has been using facial capture and recognition at several site across the city. Their proof of concept location was at the Robert F Kennedy Bridge. The company supplying the technology claimed 80% accuracy at predicting the person, through a widescreen while the car was traveling at low speed. These images can then be matched against government databases, such as driver license authorities, police databases and terrorist databases. The problem with this project was that it did not achieve one single positive match (within acceptable parameters) during the initial period of the project.

Screenshot 2019-05-20 11.10.56

There are some reports that similar technology is being use on the New York Subway system in Time Square to help with identifying fare dodgers.

How about using facial recognition at boarding gates for your new flight instead of showing your passport or other official photo id. JetBlue and other airlines are now using this technology. Some airports have been using this for many many years.

Screenshot 2019-05-20 11.16.47

San Francisco City government took steps in May 2019 to ban the use of facial recognition across all city functions. Other cities like Oakland and Sommerville in Massachusetts have implemented similar bans with other cities likely to follow. But it doesn’t ban the use by private companies.

Screenshot 2019-05-20 10.56.51

What about using this technology to automatically monitor and manage staff. Manage staff, as in to decide who should be fired and who should be reallocated elsewhere. It is reported that Amazon is using facial and other recognition systems to monitor staff productivity in their warehouses.

Screenshot 2019-05-20 11.00.20

A point I highlighted in my previous post was how are these systems/applications able to get enough images as training samples for their models. This is considering that most of the able systems/applications say they don’t keep any of the images they capture.

How many of us take pictures and post them on Facebook, Instagram, Snapchat, Twitter, etc. By doing this, you are making those images available to these companies to training their machine learning model. To do this they scrap the images for these sites and then have to manually label them with descriptive information. It is a combination of the image and descriptive information that is used by the machine learning algorithms to learn and build a model that suits their needs. See the MIT Technology Review article for more details and example on this topic.

Screenshot 2019-05-20 10.22.28

There are also reports of some mobile phone apps that turn on your mobile phone camera. The apps will detect if the phone is possibly mounted on the dashboard of a car, and then takes pictures of the inside of the car and also pictures of where you are driving. Similar reports exists about many apps and voice activated devices.

So be careful what you post on social media or anywhere else online, and be careful of what apps you have on your mobile phone!

There is a general backlash to the use of this technology, and with more people becoming aware of what is happening, we need to more aware of what when and where this technology is being used.

Transforming Missing Data using Oracle Data Mining

Posted on Updated on

In a previous post I showed how you can normalize data using the in-database machine learning feature using the DBMS_DATA_MINING.TRANSFORM function.  This same function can be used to perform many more data transformations with standardized routines. When it comes to missing data, where you have some case records where the value for an attribute is missing you have a number of options open to you. The first is to evaluate the degree of missing values for the attribute for the data set as a whole. If it is very high, you may want to remove that attribute from the data set. But in scenarios when you have a small number or percentage of missing values you will want to find an appropriate or an approximate value. Such calculations can involve the use of calculating the mean or mode.

To build this up using DBMS_DATA_MINING.TRANSFORM function, we need to follow a simple three stage process. The first stage creates a table that will contain the details of the transformations. The second stage defines and runs the transformation function to calculate the replacement values and finally, the third stage, to create the necessary records in the table created in the previous stage. These final two stages need to be followed for both numerical and categorical attributes. For the final stage you can create a new view that contains the data from the original table and has the missing data rules generated in the second stage applied to it. The following example illustrates these two stages for numerical and categorical attributes in the MINING_DATA_BUILD_V data set.

-- Transform missing data for numeric attributes
-- Stage 1 : Clean up, if previous run
--    transformed missing data for numeric and categorical
--    attributes.
BEGIN
   --
   -- Clean-up : Drop the previously created tables
   --
   BEGIN
      execute immediate 'drop table TRANSFORM_MISSING_NUMERIC';
   EXCEPTION
      WHEN others THEN
         null;
   END;

   BEGIN
      execute immediate 'drop table TRANSFORM_MISSING_CATEGORICAL';
   EXCEPTION
      WHEN others THEN
         null;
   END;

Now for stage 2 to define the functions to calculate the missing values for Numerical and Categorical variables.

-- Stage 2 : Perform the transformations
--    Exclude any attributes you don't want transformed
--      e.g. the case id and the target attribute

   --
   -- Transform the numeric attributes
   --
   dbms_data_mining_transform.CREATE_MISS_NUM (
      miss_table_name => 'TRANSFORM_MISSING_NUMERIC');

   dbms_data_mining_transform.INSERT_MISS_NUM_MEAN (
    miss_table_name => 'TRANSFORM_MISSING_NUMERIC',
    data_table_name => 'MINING_DATA_BUILD_V',
    exclude_list    => DBMS_DATA_MINING_TRANSFORM.COLUMN_LIST (
                       'affinity_card',
                       'cust_id'));

   --
   -- Transform the categorical attributes
   --
   dbms_data_mining_transform.CREATE_MISS_CAT (
      miss_table_name => 'TRANSFORM_MISSING_CATEGORICAL');

   dbms_data_mining_transform.INSERT_MISS_CAT_MODE (
      miss_table_name => 'TRANSFORM_MISSING_CATEGORICAL',
      data_table_name => 'MINING_DATA_BUILD_V',
      exclude_list    => DBMS_DATA_MINING_TRANSFORM.COLUMN_LIST (
                         'affinity_card',
                         'cust_id'));
END;

When the above code completes the two transformation tables, TRANSFORM_MISSING_NUMERIC and TRANSFORM_MISSING_CATEGORICAL, will exist in your schema.

Querying these two tables shows the table attributes along with the value to be used to relate the missing value. For example the following illustrates the missing data transformations for the categorical data.

SELECT col, 
       val 
FROM transform_missing_categorical;

For the sample data set used in these examples we get.

COL                       VAL
------------------------- -------------------------
CUST_GENDER               M
CUST_MARITAL_STATUS       Married
COUNTRY_NAME              United States of America
CUST_INCOME_LEVEL         J: 190,000 - 249,999
EDUCATION                 HS-grad
OCCUPATION                Exec.
HOUSEHOLD_SIZE            3

For stage three you will need to create a new view (MINING_DATA_V). This combines the data from original table and the missing data rules generated in the second stage applied to it. This is built in stages with an initial view (MINING_DATA_MISS_V) created that merges the data source and the transformations for the missing numeric attributes. This view (MINING_DATA_MISS_V) will then have the transformations for the missing categorical attributes applied to create the a new view called MINING_DATA_V that contains all the missing data transformations.

BEGIN
   -- xform input data to replace missing values
   -- The data source is MINING_DATA_BUILD_V
   -- The output is MINING_DATA_MISS_V

   DBMS_DATA_MINING_TRANSFORM.XFORM_MISS_NUM(
      miss_table_name => 'TRANSFORM_MISSING_NUMERIC',
      data_table_name => 'MINING_DATA_BUILD_V',
      xform_view_name => 'MINING_DATA_MISS_V');

   -- xform input data to replace missing values
   -- The data source is MINING_DATA_MISS_V
   -- The output is MINING_DATA_V
   DBMS_DATA_MINING_TRANSFORM.XFORM_MISS_CAT(
      miss_table_name => 'TRANSFORM_MISSING_CATEGORICAL',
      data_table_name => 'MINING_DATA_MISS_V',
      xform_view_name => 'MINING_DATA_V');
END;

You can now query the MINING_DATA_V view and see that the data displayed will not contain any null values for any of the attributes.

 

Examples of using Machine Learning on Video and Photo in Public

Posted on Updated on

Over the past 18 months or so most of the examples of using machine learning have been on looking at images and identifying objects in them. There are the typical examples of examining pictures looking for a Cat or a Dog, or some famous person, etc. Most of these examples are very noddy, although they do illustrate important examples.

But what if this same technology was used to monitor people going about their daily lives. What if pictures and/or video was captured of you as you walked down the street or on your way to work or to a meeting. These pictures and videos are being taken of you without you knowing.

And this raises a wide range of Ethical concerns. There are the ethics of deploying such solutions in the public domain, but there are also ethical concerns for the data scientists, machine learner, and other people working on these projects. “Just because we can, doesn’t mean we should”. People need to decide, if they are working on one of these projects, if they should be working on it and if not what they can do.

Ethics are the principals of behavior based on ideas of right and wrong. Ethical principles often focus on ideas such as fairness, respect, responsibility, integrity, quality, transparency and trust.  There is a lot in that statement on Ethics, but we all need to consider that is right and what is wrong. But instead of wrong, what is grey-ish, borderline scenarios.

Here are some examples that might fall into the grey-ish space between right and wrong. Why they might fall more towards the wrong is because most people are not aware their image is being captured and used, not just for a particular purpose at capture time, but longer term to allow for better machine learning models to be built.

Can you imagine walking down the street with a digital display in front of you. That display is monitoring you, and others, and then presents personalized adverts on the digital display aim specifically at you. A classify example of this is in the film Minority Report. This is no longer science fiction.

Screenshot 2019-05-10 14.12.55

This is happening at the Westfield shopping center in London and in other cities across UK and Europe. These digital advertisement screens are monitoring people, identifying their personal characteristics and then customizing the adverts to match in with the profile of the people walking past. This solutions has been developed and rolled out by Ocean Out Door. They are using machine learning to profile the individual people based on gender, age, facial hair, eye wear, mood, engagement, attention time, group size, etc. They then use this information to:

  1. Optimisation – delivering the appropriate creative to the right audience at the right time.
  2. Visualise – Gaze recognition to trigger creative or an interactive experience
  3. AR Enabled – Using the HD cameras to create an augmented reality mirror or window effect, creating deep consumer engagement via the latest technology
  4. Analytics – Understanding your brand’s audience, post campaign analysis and creative testing

Screenshot 2019-05-10 14.19.35.png

Face Plus Plus can monitor people walking down the street and do similar profiling, and can bring it to another level where by they can identify what clothing you are wearing and what the brand is. Image if you combine this with location based services. An example of this, imagine you are walking down the high street or a major retail district. People approach you trying to entice you into going into a particular store, and they offer certain discounts. But you are with a friend and the store is not interested in them.

Screenshot 2019-05-10 14.28.23

The store is using video monitoring, capturing details of every person walking down the street and are about to pass the store. The video is using machine/deep learning to analyze you profile and what brands you are wearing. The store as a team of people who are deployed to stop and engage with certain individuals, just because they make the brands or interests of the store and depending on what brands you are wearing can offer customized discounts and offers to you.

How comfortable would you be with this? How comfortable would you be about going shopping now?

For me, I would not like this at all, but I can understand why store and retail outlets are interested, as they are all working in a very competitive market trying to maximize every dollar or euro they can get.

Along side the ethical concerns, we also have some legal aspects to consider. Some of these are a bit in the grey-ish area, as some aspects of these kind of scenarios are slightly addresses by EU GDPR and the EU Artificial Intelligence guidelines. But what about other countries around the World. Then it comes to training and deploying these facial models, they are dependent on having a good training data set. This means they needs lots and lots of pictures of people and these pictures need to be labelled with descriptive information about the person. For these public deployments of facial recognition systems, then will need more and more training samples/pictures. This will allow the models to improve and evolve over time. But how will these applications get these new pictures? They claim they don’t keep any of the images of people. They only take the picture, use the model on it, and then perform some action. They claim they do not keep the images! But how can they improve and evolve their solution?

I’ll have another blog post giving more examples of how machine/deep learning, video and image captures are being used to monitor people going about their daily lives.

 

HiveMall: Transform Categorical features to Numerical

Posted on Updated on

HiveMall is a machine learning library that sits on top of Hive and provides SQL interface to wide range of data preparation and machine learning algorithms.

A common task faced for many machine learning exercises is to convert the data from the format it is captured in (raw data) into a format that is required by the machine learning algorithms. Most ML tools will either have functionality built into the algorithms to do this automatically or will provide functions to allow you to manage this process yourself.

In HiveMall we have the ‘quantified_features’ function and is used for transforming values of non-number columns to indexed numbers, but it does have some unusual but useful features.

In this example I’ll use the titanic data set to illustrate the usage of this feature.

Screenshot 2019-04-29 15.14.42

Here we have a mixture of features with categorical and numerical.

select 
  quantified_features(
    ${output_row}, PassengerId, Survived, Pclass, Sex, Age, SibSp, Parch, Fare, Cabin, Embarked) as features
from (
  select * from titanic
  order by Passengerid asc
) t
limit 5;

and we get the following output

[1.0,0.0,0.0,3.0,0.0,22.0,1.0,0.0,7.25,0.0,1.0]
[2.0,1.0,1.0,1.0,1.0,38.0,1.0,0.0,71.2833,1.0,2.0]
[3.0,1.0,1.0,3.0,1.0,26.0,0.0,0.0,7.9250,0.0,1.0]
[4.0,1.0,1.0,1.0,1.0,35.0,1.0,0.0,53.1,3.0,1.0]
[5.0,1.0,0.0,3.0,0.0,35.0,0.0,0.0,8.05,0.0,1.0]

The ordering within the attributes is important, and some thinking is needed if there is a defined order and you want this reflected in the outputs of the transformed features

If you are a numeric field that you want treated as a categorical, and transformed, you can cast it into a string

e.g.

cast(SibSp as string)

Migrating Python ML Models to other languages

Posted on Updated on

I’ve mentioned in a previous blog post about experiencing some performance issues with using Python ML in production. We needed something quicker and the possible languages we considered were C, C++, Java and Go Lang.

But the data science team used R and Python, with just a few more people using Python than R on the team.

One option was to rewrite everything into the language used in production. As you can imagine no-one wanted to do that and there was no way of ensure a bug free solution and one that gave similar results to the R and Python models. The other option was to look for some code to convert the models from one language to another.

The R users was well versed in using PMML. Predictive Model Markup Language (PMML) has been around a long time and well known and used by certain groups of data scientists who have been around a while. It is also widely supported by many analytics vendors, and provides an inter-change format to allow predictive models to be described and exchanged. For newer people, they hadn’t heard of it. PMML is an XML based interchange specification.

But with PMML there are some limitation. Not with the specification but how it is implemented by the various vendors that support it. PMML supports the exchange of the model pipeline including the data transformations as well as the model specification. Most vendors only support some elements of this and maybe just a couple of models. And there-in lies the problem. How can a ML pipeline be migrated from, as Python, to some other language and/or tool. There are limitations.

If you do want to explore PMML with Python check out the sklearn2pmml package and is also available on PyPl. This package allows you to export the ML pipeline and the model specification. As with most other implementations of PMML there are some parts of the PMML specification not implement, but it is better than post of the other implementation out there.

An alternative is to look at code translations options. With these we want something that will take our ML pipeline and convert it to another programming language like C++, JAVA, Go, etc. There aren’t too many solutions available to do this. One such solution we’ve explored over the past couple of weeks is called m2cgen.

m2cgen (Model 2 Code Generator) is a lightweight library which provides an easy way to transpile trained statistical models into a native code (Python, C, Java, Go). You can supply M2cgen with a range of models (linear, SVM, tree, random forest, or boosting, etc) and the tool will output code in the chosen language that will represent the trained model. The code generated will generated into native code without dependencies. Other packages or libraries are not dependent or required in the translated language. For example here is an example Decision Tree translated into a number of different languages.

 

C

#include <string.h>
void score(double * input, double * output) {
    double var0[3];
    if ((input[2]) <= (2.6)) {
        memcpy(var0, (double[]){1.0, 0.0, 0.0}, 3 * sizeof(double));
    } else {
        if ((input[2]) <= (4.8500004)) {
            if ((input[3]) <= (1.6500001)) {
                memcpy(var0, (double[]){0.0, 1.0, 0.0}, 3 * sizeof(double));
            } else {
                memcpy(var0, (double[]){0.0, 0.3333333333333333, 0.6666666666666666}, 3 * sizeof(double));
            }
        } else {
            if ((input[3]) <= (1.75)) {
                memcpy(var0, (double[]){0.0, 0.42857142857142855, 0.5714285714285714}, 3 * sizeof(double));
            } else {
                memcpy(var0, (double[]){0.0, 0.0, 1.0}, 3 * sizeof(double));
            }
        }
    }
    memcpy(output, var0, 3 * sizeof(double));
}

Java

public class Model {

    public static double[] score(double[] input) {
        double[] var0;
        if ((input[2]) <= (2.6)) {
            var0 = new double[] {1.0, 0.0, 0.0};
        } else {
            if ((input[2]) <= (4.8500004)) {
                if ((input[3]) <= (1.6500001)) {
                    var0 = new double[] {0.0, 1.0, 0.0};
                } else {
                    var0 = new double[] {0.0, 0.3333333333333333, 0.6666666666666666};
                }
            } else {
                if ((input[3]) <= (1.75)) {
                    var0 = new double[] {0.0, 0.42857142857142855, 0.5714285714285714};
                } else {
                    var0 = new double[] {0.0, 0.0, 1.0};
                }
            }
        }
        return var0;
    }
}

Go Lang

func score(input []float64) []float64 {
    var var0 []float64
    if (input[2]) <= (2.6) {
        var0 = []float64{1.0, 0.0, 0.0}
    } else {
        if (input[2]) <= (4.8500004) {
            if (input[3]) <= (1.6500001) {
                var0 = []float64{0.0, 1.0, 0.0}
            } else {
                var0 = []float64{0.0, 0.3333333333333333, 0.6666666666666666}
            }
        } else {
            if (input[3]) <= (1.75) {
                var0 = []float64{0.0, 0.42857142857142855, 0.5714285714285714}
            } else {
                var0 = []float64{0.0, 0.0, 1.0}
            }
        }
    }
    return var0
}

 

Machine Learning with Go Lang

Posted on Updated on

Recently I’ve been having a number of conversations with people in several countries about using Go Lang for machine learning. Most of these people have been struggling with using Python for machine learning and are looking for an alternative that will give them better performance. We have been experimenting with C++ and Go Lang to see what the performance differences are. Most of these are with the execution of the ML code. This is great and everyone is very happy with execution timings, compared to Python.

But, there is a flip side to this. Although we have faster execution timings, there is a down side in that the coding effort is higher, with more lines of code and fewer libraries/packages to support the various ML tasks. But most of these can be easily coded ourselves .

We also looked at some frameworks for converting ML models developed in one language but deployed in production using a different language. More on that in another post.

Overall the extra development work was considered worthwhile for the performance improvement and deployment gains.

Go Lang doesn’t really come with it’s own set of libraries/packages for ML, but those have a number of these that can be used to code up the necessary functions we need for our everyday ML needs.

But are there any Go Lang libraries/packages developed for ML, just like we have for the R Language, etc?  The simple answer is YES we have. But the number of these is small in comparison to R and Python. Both of these languages are interpreted languages. But those available for Go are slowly growing.

Here is list of the Go Lang libraries/packages that we examined and evaluated for these projects. Some are available from the Go Lang website/wiki and others are available on Github.

  • Anna – Artificial Neural Network Aspiration, aims to be self-learning and self-improving software.
  • bayesian – A naive bayes classifier.
  • Dialex – Dialex is a smart pipe that unscrambles text and makes it machine-readable.
  • Cloudforest – Ensembles of decision trees
  • ctw – Context Tree Weighting and Rissanen-Langdon Arithmetic Coding
  • eaopt – An evolutionary optimization library.
  • evo – a framework for implementing evolutionary algorithms in Go.
  • gobrain – Neural Networks
  • Go Learn – Machine Learning for Go
  • go-algs/maxflow Maxflow (graph-cuts) energy minimization library.
  • go-graph – Graph library for Go/Golang language
  • go-galib – Genetic algorithms.
  • go-pr – Pattern recognition package in Go lang
  • golinear – Linear SVM and logistic regression.
  • go-mind – A neural network library built in Go
  • go_ml – Linear Regression, Logistic Regression, Neural Networks, Collaborative Filtering, Gaussian Multivariate Distribution.
  • go-ml-transpiler – An open source Go transpiler for machine learning models.
  • go-mxnet-predictor – Go binding for MXNet c_predict_api to do inference with pre-trained model.
  • gorgonia – Neural network primitives library (like Theano or Tensorflow but for Go)
  • go-porterstemmer – An efficient native Go clean room implementation of the Porter Stemming algorithm.
  • go-pr – Gaussian classifier.
  • ntmNeural Turing Machines implementation
  • paicehusk – Go implementation of the Paice/Husk Stemmer
  • RF – Random forests implementation in Go
  • tfgo – Tensorflow + Go, the gopher way.