Data Science

Listed in 2 categories of “Who’s Who in Data Science & Machine Learning?”

Posted on

I’ve received notification I’ve been listed in the “Who’s Who in Data Science & Machine Learning?” lists created by Onalytica. I’ve been listed in not just one category but two categories. These are:

  • Key Opinion Leaders discussing Data Science & Machine Learning
  • Big Data

This is what Onalytica says about their report and how the list for each category was put together. “The influential experts are selected using Onalytica’s 4 Rs methodology (Reach, Resonance, Relevance and Reference). Quantitative data is pulled through LinkedIn, Twitter, Personal Blogs, YouTube, Podcast, and Forbes channels, and our qualitative data is pulled by our insights and analytics team, capturing offline influenc”. “All the influential experts featured are categorised by influencer persona, the sector they work in, their role within that sector, and more from our curated database of 1m+ influencers”. “Our Who’s Who lists are created using the Onalytica platform which has a curated database of over 1 million influencers. Our platform allows you to discover, validate and categorise influencers quickly and easily via keyword searches. Our lists are made using carefully created Boolean queries which then rank influencers by resonance, relevance, reach and reference, meaning influencers are not only ranked by themselves, but also by how much other influencers are referring to them. The lists are then validated, and filters are used to split the influencers up into the categories that are seen in the list.”

Check out the full report on “Who’s Who in Data Science & Machine Learning?

AutoML, what is it good for? It Depends!

Posted on Updated on

Automated Machine Learning (AutoML) seems to be everywhere and every Analytics product and SaaS offering seems to have some element of AutoML built into them. Part of the reason for this is because most of the market analysts, such as Gartner etc., have been rating Machine Learning (ML) products and services based on them having an AutoML feature.

Some of the benefits of AutoML is it will automatically generate a ML model for you without you having to worry about any of the technical details and the various statistical tests to measure if the model is useful. This kind of message has resulted is lots and lots of articles talking about the death of the Data Scientist, as they are no longer needed. We must remember ML is only one of the tools and skills of the data scientist.

This can all sound great. No need to hire these expensive data scientists, I can just use this AutoML software to create a ML model, for my data, and life will be good with all these wonderful predictions. Just think of the money I’ll be making and saving!

Where the fun comes into all of this is when someone issues legal proceedings based on what one of these AutoML models has predicted. The AutoML has made an incorrect prediction. The problem you now face, probably in court, is trying to justify the prediction by saying the machine/computer/algorithm made it, and you have no idea how or what it is doing to make the prediction. Good luck in a court explaining that to a judge and/or jury. Be prepared to hand over lots of money

What is missing is the human in the loop, and in most cases this will be the data scientist or machine learning engineer (or someone else with a really cool job title). Part of their job is to evaluate lots of difference models for you data (remember they will create lots and lots of models and not just one!), determine (from experimentation) what algorithms work best with your data and problem, optimize these models and assess the impact of changing hyperparameters, look at how these ML models are behaving, are there any biases in the model or data, use a wide variety of statistic tests to assess the models, examine how the model works with different sub-parts of the data (customers), look at any potential legal and legislative issues not just in one geographic but across many disparate regions all of which have different legal requirements, etc.

As you can see there are many additional tasks beyond the ML steps needed to create, verify and select a ML to use. All of this is before you look at how it can be deployed in your production systems/architecture and building out you MLOps.

One importing characteristic of having the human in the loop is Explainability. Explainability of the process followed, what models were produced, the effect of tuning and opimizing, possible biases and mitigating steps, etc etc  The list goes on and on. This the role of the data scientist and now it might look like a good idea to hire a good data scientist who understands all of this.

Taking a little step back, AutoML is kind of good cool feature/tool. A lot of the main steps of creating all those ML models, tuning them and evaluating them, etc can be very boring work. You do same steps for each model and do it all over again for the next, and so on for the tens or hundreds of models you will be creating. Most data scientists will have scripts in their toolbox (based from their experience) to automatically perform all of these steps and output the results.  I mentioned the word experience in the last sentence. It can take a bit of time to build up to this. The AutoML products will do all of this automatically for you hence you don’t have to hire a data scientist to do it (see what I said above about this).

I mentioned above some of the challenges and the need to keep a human in the loop. AutoML can be seen as another tool to assist the data scientist and not to replace them. AutoML can be used to to help the data scientist work towards identifying what ML models to use. But this can be a bit of a challenge to do. It depends on what product or library you use. Some AutoML solutions act as a black box. Kind of like the image at the top of this post. These are simple to use but the draw back is there is not explainability or ability of the data scientist to really assess what is happening at each step. There are AutoML products/solutions that allow you to inspect and monitor what is happening at each step within AutoML. The diagram given able is one example of this. This allows for the human in the loop and allows for explainability. If the data scientist sees some unusual direction being taken by AutoML they can see where and why this is happening and can take corrective action. AutoML isn’t a black box in this scenario.

I mentioned above, AutoML can be another tool for the data scientist to use. Look on AutoML as quick way to see what might be possible. Using the information from each step of AutoML, the data scientist can use this information to guide them towards creating a more suitable and usable ML model, and do so in perhaps a slightly shorter space of time.

Going back to the title of the post ‘AutoML, what is it good for?’, the answer really is ‘It Depends!’, but if you do use it, be careful how you use the models and results beyond doing some simple investigation. And be careful of product offerings saying you don’t need anything else.

Setting up Julia to work with Oracle Database

Posted on Updated on

For Data Science projects the top three languages every data scientist and machine learning practitioner knows are Python, R and SQL. The ranking or order of importance of these is of some debate and the reason answer is, ‘It Depends’. But one thing is for sure no matter what your environment, SQL skills will be needed, because that’s where the data lives, in the various databases of the organization. No matter what the database is SQL is the way to access and analyze it efficiently. But for Python and R, the popularity of these languages really depends on the project team and their background. Deciding between the two can come down to flipping a coin. But every has their favorite!

A (or not so) new language for data science and machine learning is Julia. Actually it has been around for a while now, and life began on it in 2009, whereas R (and S) and Python have their beginnings back in the 1980’s and early 1990’s. Does that make them legacy programming languages? or it just took a bit of time to mature and gain popularity?

There are lots of advantages to Julia, just like there are lots of advantages with the other languages. The following diagram illustrates one of the core advantages of Julia, it isn’t an interpreted language like R and Python, which means Julia will be significantly faster, yet still allows interactive development using Notebooks, just like R and Python. Julia was designed and build for data science and machine learning, and is designed for scale which makes it a good fit for MLOps. The list of advantages and differences can go on a bit and those are not the point of this post.

The remainder of this post will step through what is needed to get Julia working with an Oracle Database, and you have setup an IDE. Check out the Julia website for excellent installation instructions and selecting an IDE. If you coming from an R and/or Python background, using Jupyter Notebooks is a good option, and as you become more experienced there are a number of more advanced IDEs available for you to use. I’m assuming you have installed Julia.

If you have done a new install of Julia, make sure to add the install directory to the search PATH.

First Download load and install Oracle Instant Client. This is needed by the Julia packages to communicate with Oracle Database. After installing make sure to setup the following in your environment (environment variables and Path)

  • ORACLE_HOME : points to where you installed Oracle Instant Client
  • TNS_ADMIN : points to the directory containing the wallet/tnsnames files. This will be a sub-directory in Oracle Instant Client directory, for example, it points to  …/instantclient_19_8/network/admin
  • PATH : include the Oracle Instant Client install directory in the PATH.

Next step is to setup the Oracle Client network files. As your DBA for the tnsnames.ora file or for the Wallet Zip file for your database. The Wallet Zip file is the most common approach.  Unzip this Wallet file and copy the unzipped files to the TNS_ADMIN directory. See the second bullet point above to for this (…/instantclient_19_8/network/admin).

That’s all you need to do on the Oracle setup.  I’m assuming you have a username and password for the Oracle Database you will be using.

Now we can setup Julia to use the Oracle Instant Client software.  It is important you have setup those environment variables l’ve listed above.

There is an Oracle.jl package, developed by Felipe Noronha, which runs on top of Oracle Instant Client. To install this, load the Pkg package then then add the Oracle package. The following shows these commands and part of the output from the installation.

julia> using Pkg

julia> Pkg.add("Oracle")
Updating registry at `~/.julia/registries/General`
######################################################################## 100.0%
Resolving package versions...
Installed Reexport ──────────────────── v1.0.0
Installed libsodium_jll ─────────────── v1.0.18+1
Installed Compat ────────────────────── v3.25.0
Installed OrderedCollections ────────── v1.3.3
Installed WebSockets ────────────────── v1.5.9
Installed JuliaInterpreter ──────────── v0.8.8
Installed DataStructures ────────────── v0.18.9
Installed DataAPI ───────────────────── v1.5.1
Installed Requires ──────────────────── v1.1.2
Installed DataValueInterfaces ───────── v1.0.0
Installed Parsers ───────────────────── v1.0.15
Installed FlameGraphs ───────────────── v0.2.5
Installed URIs ──────────────────────── v1.2.0
Installed Colors ────────────────────── v0.12.6
Installed Oracle ────────────────────── v0.2.0
...
...
...
[7240a794] + Oracle v0.2.0
[bac558e1] ↑ OrderedCollections v1.3.2 ⇒ v1.3.3
[69de0a69] ↑ Parsers v1.0.12 ⇒ v1.0.15
[189a3867] ↑ Reexport v0.2.0 ⇒ v1.0.0
[ae029012] ↑ Requires v1.1.1 ⇒ v1.1.2
[3783bdb8] + TableTraits v1.0.0
[bd369af6] + Tables v1.3.2
[0796e94c] ↑ Tokenize v0.5.8 ⇒ v0.5.13
[5c2747f8] + URIs v1.2.0
[104b5d7c] ↑ WebSockets v1.5.2 ⇒ v1.5.9
[8f1865be] ↑ ZeroMQ_jll v4.3.2+5 ⇒ v4.3.2+6
[a9144af2] + libsodium_jll v1.0.18+1
Building Oracle → `~/.julia/packages/Oracle/CEOWz/deps/build.log`

julia>

You are now ready to load this Oracle package and use it to connect to an Oracle Database. Setting up a connection is really simple and in the following example I’m connecting to an ATP Database on Oracle Free Tier. The following sets up some variables, creates a connection, prints a statement and connection information and then closes the connection.

import Oracle

username="oml_user"
password="xxxxxxxxxxx"
dbname="yyyyyyyyyyyy"

conn = Oracle.Connection(username, password, dbname)

println("Connected")
println(conn)

Oracle.close(conn)

Job done 🙂

There is little additional connection information available. To test the connection a bit more let’s list what tables I have in my test/demo schema/user.

import Oracle

username="oml_user"
password="xxxxxxxxxxx"
dbname="yyyyyyyyyyyy"

conn = Oracle.Connection(username, password, dbname)

println("Tables")
println("--------------------")

Oracle.query(conn, "SELECT table_name FROM user_tables") do cursor
    for row in cursor
    # row values can be accessed using column name or position
        println( row["TABLE_NAME"] ) # same as row[1]
    end
end

println("")
println("...the end...")

Oracle.close(conn)

If you come from a Python background the syntax is familiar which makes the move other to Julia an easier task.

One other difference is, running the above code does seem to run a lot quicker in Julia. I haven’t measured it and the difference is less than a second but it is noticeable.  For me, the above code generate the following output,

Tables
--------------------
WINE
BANK_ADDITIONAL_FULL
MINING_DATA_BUILD_V

...the end...

I’ll have additional posts looking are difference aspects and commands for working with and processing data in an Oracle Database.

2020 Books on Data Science and Machine Learning

Posted on Updated on

2020 has been an interesting year. Not for the obvious topic, but for new books on Data Science and Machine Learning. The list below are some of my favorite books from 2020. Making the selection was difficult. Some months had a large number of releases and some were a bit quieter. The books below are listed based on their release date and are not ranked in any way. I’ve included links to these books on Amazon (.com, .uk and .de).

January

Everyone wants to work in Data Science, but where and how do you start. Aimed at beginners with guidance without the technical. High level, not for everyone.

amazon.com amazon.co.uk amazon.de

February

Taking ML to the next stage creating AI application. How to do it with examples across a number of areas.

amazon.com amazon.co.uk amazon.de

March

A guide for those wary of impact of technology’s and for those who are enthusiastic about where AI is taking us.

amazon.com amazon.co.uk amazon.de

April

AI Ethics was one of the topic topics for 2020. Covers the philosophical aspects along with the technical one.s

amazon.com amazon.co.uk amazon.de

May

Covering the life-cyle of building ML application, showing all that it entails and how ML plays a small part in the overall solution

amazon.com amazon.co.uk amazon.de

June

From covering the basics of NLP, it builds on this to include in application, how to use in different industries and within project teams.

amazon.com amazon.co.uk amazon.de

July

With by Thomas Davenport and others, and is a good addition to his other books. Consisting of interviews, research and analysis on how to win with ML & AI.

amazon.com amazon.co.uk amazon.de

August

I was invited to contribute a couple of chapters to this book, along with well known names in areas of DS, ML & AI

amazon.com amazon.co.uk amazon.de

September

Building upon the success of their 1st edition, the 2nd edition comes with more example and extra chapters.

amazon.com amazon.co.uk amazon.de

October

ML & AI is not perfect. Lots can go wrong. Not just with the project but also with the implementation of the applications. Lots to thing about and consider.

amazon.com amazon.co.uk amazon.de

November

No one really builds ML algorithms. We build ML solutions and applications. But whats the best way to do this, from technical, organizational and ethical aspects.

amazon.com amazon.co.uk amazon.de

December

It was difficult to pick a book for this month. Lots of new releases and I haven’t received all my orders, at time of this post.

Here is a book from July, and is related to an Automated Trading App I’ve been working on (and earning) for a couple of years.

amazon.com  amazon.co.uk  amazon.de

And to finish off the list I’m including this additional book. It wasn’t released this year. It was released in April 2018. It was a best seller on Amazon in 2018 and 2019!  This was really exciting for us and we still amazed at how it it is still selling in 2020. It is currently, as of December 2020, listed in 8th place on the MIT Press Best Sellers list. It won’t be making any best seller list in 2020, but is still proving popular with many readers. To all of you who have bought this book, I’d like to say Thank You and wishing you all the best with 2021 and beyond.

Adding Text Processing to Classification Machine Learning in Oracle Machine Learning

Posted on Updated on

One of the typical machine learning functions is Classification. This is in widespread use across most domains and geographic regions. I’ve written several blog posts on this topic over many years (and going back many, many year) on how to do this using Oracle Machine Learning (OML) (formally known as Oracle Advanced Analytic and in the Oracle Data Miner tool in SQL Developer). Just do a quick search of my blog to find some of these posts.

When it comes to Classification problems, typically the data set will be contain your typical categorical and numerical variables/features. The Automatic Data Preparation (ADP) feature of OML where it automatically pre-processes and transforms these variable for input to the machine learning algorithm. This greatly reduces the boring work of the data scientist and increases their productivity.

But sometimes data sets come with text descriptions. These will contain production descriptions, free format text, and other descriptive data, for example product reviews. But how can this information be included as part of the input data set to the machine learning algorithms. Oracle allows this kind of input data, and a letting bit of setup is needed to tell Oracle how to process the data set. This uses the in-database feature of Oracle Text.

The following example walks through an example of the steps needed to pre-process and include the text processing as part of the machine learning algorithm.

The data set: The data used to illustrate this and to show the steps needed, is a data set from Kaggle webiste. This data set contains 130K Wine Reviews. This data set contain descriptive information of the wine with attributes about each wine including country, region, number of points, price, etc as well as a text description contain a review of the wine.

The following are 2 files containing the DDL (to create the table) and then Import the data set (using sql script with insert statements). These can be run in your schema (in order listed below).

  1. Create table WINEREVIEWS_130K_IMP
  2. Insert records into WINEREVIEWS_130K_IMP table

I’ll leave the Data Exploration to you to do and to discover some early insights.

The ML Question

I want to be able to predict if a wine is a good quality wine, based on the prices and different characteristics of the wine?

Data Preparation

To be able to answer this question the first thing needed is to define a target variable to identify good and bad wines. To do this create a new attribute/feature called POINTS_BIN and populate it based on the number of points a wine has. If it has >90 points it is a good wine, if <90 points it is a bad wine.

ALTER TABLE WineReviews130K_bin ADD POINTS_BIN VARCHAR2(15);

UPDATE WineReviews130K_bin
SET POINTS_BIN = 'GT_90_Points'
WHERE winereviews130k_bin.POINTS >= 90;

UPDATE WineReviews130K_bin
SET POINTS_BIN = 'LT_90_Points'
WHERE winereviews130k_bin.POINTS < 90;

alter table WineReviews130K_bin DROP COLUMN POINTS;

The DESCRIPTION column data type needs to be changed to CLOB. This is to allow the Text Mining feature to work correctly.

-- add a new column of data type CLOB
ALTER TABLE WineReviews130K_bin ADD (DESCRIPTION_NEW CLOB);

-- update new column with data from the DESCRIPTION attribute
UPDATE WineReviews130K_bin SET DESCRIPTION_NEW = DESCRIPTION;

-- drop the DESCRIPTION attribute from table
ALTER TABLE WineReviews130K_bin DROP COLUMN DESCRIPTION;

-- rename the new attribute to replace DESCRIPTION
ALTER TABLE WineReviews130K_bin RENAME COLUMN DESCRIPTION_NEW TO DESCRIPTION;

Text Mining Configuration

There are a number of things we need to define for the Text Mining to work, these include a Lexer, Stop Word list and preferences.

First define the Lexer to use. In this case we will use a basic one and basic settings

BEGIN 
   ctx_ddl.create_preference('mylex', 'BASIC_LEXER'); 
   ctx_ddl.set_attribute('mylex', 'printjoins', '_-'); 
   ctx_ddl.set_attribute ( 'mylex', 'index_themes', 'NO'); 
   ctx_ddl.set_attribute ( 'mylex', 'index_text', 'YES'); 
END;

Next we can define a Stop Word List. Oracle Text comes with a predefined set of Stop Word lists for most of the common languages. You can add to one of those list or create your own. Depending on the domain you are working in it might be easier to create your own and it is very straight forward to do. For example:

DECLARE
   v_stoplist_name varchar2(100);
BEGIN
   v_stoplist_name := 'mystop';
   ctx_ddl.create_stoplist(v_stoplist_name, 'BASIC_STOPLIST'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'nonetheless');
   ctx_ddl.add_stopword(v_stoplist_name, 'Mr'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'Mrs'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'Ms'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'a'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'all'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'almost'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'also'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'although'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'an'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'and'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'any'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'are'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'as'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'at'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'be'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'because'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'been'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'both'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'but'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'by'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'can'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'could'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'd'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'did'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'do'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'does'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'either'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'for'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'from'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'had'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'has'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'have'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'having'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'he'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'her'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'here'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'hers'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'him'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'his'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'how'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'however'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'i'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'if'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'in'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'into'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'is'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'it'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'its'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'just'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'll'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'me'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'might'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'my'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'no'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'non'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'nor'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'not'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'of'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'on'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'one'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'only'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'onto'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'or'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'our'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'ours'); 
   ctx_ddl.add_stopword(v_stoplist_name, 's'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'shall'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'she'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'should'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'since'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'so'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'some'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'still'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'such'); 
   ctx_ddl.add_stopword(v_stoplist_name, 't'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'than'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'that'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'the'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'their'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'them'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'then'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'there'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'therefore'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'these'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'they'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'this'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'those'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'though'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'through'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'thus'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'to'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'too'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'until'); 
   ctx_ddl.add_stopword(v_stoplist_name, 've'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'very'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'was'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'we'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'were'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'what'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'when'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'where'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'whether'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'which'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'while'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'who'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'whose'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'why'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'will'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'with'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'would'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'yet'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'you'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'your'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'yours'); 
   ctx_ddl.add_stopword(v_stoplist_name, 'drink');
   ctx_ddl.add_stopword(v_stoplist_name, 'flavors'); 
   ctx_ddl.add_stopword(v_stoplist_name, '2020');
   ctx_ddl.add_stopword(v_stoplist_name, 'now'); 
END;

Next define the preferences for processing the Text, for example what Stop Word list to use, if Fuzzy match is to be used and what language to use for this, number of tokens/words to process and if stemming is to be used.

BEGIN 
   ctx_ddl.create_preference('mywordlist', 'BASIC_WORDLIST');
   ctx_ddl.set_attribute('mywordlist','FUZZY_MATCH','ENGLISH'); 
   ctx_ddl.set_attribute('mywordlist','FUZZY_SCORE','1'); 
   ctx_ddl.set_attribute('mywordlist','FUZZY_NUMRESULTS','5000'); 
   ctx_ddl.set_attribute('mywordlist','SUBSTRING_INDEX','TRUE'); 
   ctx_ddl.set_attribute('mywordlist','STEMMER','ENGLISH'); 
END;

And the final step is to piece it all together by defining a new Text policy

BEGIN
   ctx_ddl.create_policy('my_policy', NULL, NULL, 'mylex', 'mystop', 'mywordlist');
END;

Define Settings for OML Model

We will create two models. An Attribute Importance model and a Classification model. The following defines the model parameters for each of these.

CREATE TABLE att_import_model_settings (setting_name varchar2(30), setting_value varchar2(30)); 
INSERT INTO att_import_model_settings (setting_name, setting_value)  
VALUES (''ALGO_NAME'', ''ALGO_AI_MDL'');
INSERT INTO att_import_model_settings (setting_name, setting_value) 
VALUES (''PREP_AUTO'', ''ON'');
INSERT INTO att_import_model_settings (setting_name, setting_value) 
VALUES (''ODMS_TEXT_POLICY_NAME'', ''my_policy'');
INSERT INTO att_import_model_settings (setting_name, setting_value) 
VALUES (''ODMS_TEXT_MAX_FEATURES'', ''3000'')';
CREATE TABLE wine_model_settings (setting_name varchar2(30), setting_value varchar2(30)); 
INSERT INTO wine_model_settings (setting_name, setting_value)  
VALUES (''ALGO_NAME'', ''ALGO_RANDOM_FOREST'');
INSERT INTO wine_model_settings (setting_name, setting_value) 
VALUES (''PREP_AUTO'', ''ON'');
INSERT INTO wine_model_settings (setting_name, setting_value) 
VALUES (''ODMS_TEXT_POLICY_NAME'', ''my_policy'');
INSERT INTO wine_model_settings (setting_name, setting_value) 
VALUES (''ODMS_TEXT_MAX_FEATURES'', ''3000'')';

Create the Training and Test data sets.

CREATE TABLE wine_train_data
AS SELECT id, country, description, designation, points_bin, price, province, region_1, region_2, taster_name, variety, title
FROM winereviews130k_bin 
SAMPLE (60) SEED (1);
CREATE TABLE wine_test_data
AS SELECT id, country, description, designation, points_bin, price, province, region_1, region_2, taster_name, variety, title
FROM winereviews130k_bin 
WHERE id NOT IN (SELECT id FROM wine_train_data);

All the set up is done, we can move onto the creating the machine learning models.

Create the OML Model (Attribute Importance & Classification)

We are going to create two models. The first is an Attribute Important model. This will look at the data set and will determine what attributes contribute most towards determining the target variable. As we are incorporting Texting Mining we will see what words/tokens from the DESCRIPTION attribute also contribute towards the target variable.

BEGIN
   DBMS_DATA_MINING.CREATE_MODEL(
      model_name          => 'GOOD_WINE_AI',
      mining_function     => DBMS_DATA_MINING.ATTRIBUTE_IMPORTANCE,
      data_table_name     => 'winereviews130k_bin',
      case_id_column_name => 'ID',
      target_column_name  => 'POINTS_BIN',
      settings_table_name => 'att_import_mode_settings');
END;

We can query the system views for Oracle ML to find out what are the important variables.

SELECT * FROM dm$vagood_wine_ai 
ORDER BY attribute_rank;

Here is the listing of the top 15 most important attributes. We can see from the first 15 rows and looking under column ATTRIBUTE_SUBNAME, the words from the DESCRIPTION attribute that seem to be important and contribute towards determining the value in the target attribute.

At this point you might determine, based on domain knowledge, some of these words should be excluded as they are generic for the domain. In this case, go back to the Stop Word List and recreate it with any additional words. This can be repeated until you are happy with the list. In this example, WINE could be excluded by including it in the Stop Word List.

Run the following to create the Classification model. It is very similar to what we ran above with minor changes to the name of the model, the data mining function and the name of the settings table.

BEGIN
   DBMS_DATA_MINING.CREATE_MODEL(
      model_name          => 'GOOD_WINE_MODEL',
      mining_function     => DBMS_DATA_MINING.CLASSIFICATION,
      data_table_name     => 'winereviews130k_bin',
      case_id_column_name => 'ID',
      target_column_name  => 'POINTS_BIN',
      settings_table_name => 'wine_model_settings');
END;

Apply OML Model

The model can be applied in similar ways to any other ML model created using OML. For example the following displays the wine details along with the predicted points bin values (good or bad) and the probability score (<=1) of the prediction.

SELECT id, price, country, designation, province, variety, points_bin, 
       PREDICTION(good_wine_mode USING *) pred_points_bin,
       PREDICTION_PROBABILITY(good_wine_mode USING *) prob_points_bin
FROM wine_test_data;

Exploring Database trends using Python pytrends (Google Trends)

Posted on

A little word of warning before you read the rest of this post. The examples shown below are just examples of what is possible. It isn’t very scientific or rigorous, so don’t come complaining if what is shown doesn’t match your knowledge and other insights. This is just a little fun to see what is possible. Yes a more rigorous scientific study is needed, and some attempts at this can be seen at DB-Engines.com. Less scientific are examples shown at TOPDB Top Database index and that isn’t meant to be very scientific.

After all of that, here we go 🙂

pytrends is a library providing an API to Google Trends using Python. The following examples show some ways you can use this library and the focus area I’ll be using is Databases. Many of you are already familiar with using Google Trends, and if this isn’t something you have looked at before then I’d encourage you to go have a look at their website and to give it a try. You don’t need to run Python to use it. For example, here is a quick example taken from the Google Trends website. Here are a couple of screen shots from Google Trends, comparing Relational Database to NoSQL Database. The information presented is based on what searches have been performed over the past 12 months. Some of the information is kind of interesting when you look at the related queries and also the distribution of countries.

To install pytrends use the pip command

pip3 install pytrends

As usual it will change the various pendent libraries and will update where necessary. In my particular case, the only library it updated was the version of pandas.

You do need to be careful of how many searches you perform as you may be limited due to Google rate limits. You can get around this by using a proxy and there is an example on the pytrends PyPi website on how to get around this.

The following code illustrates how to import and setup an initial request. The pandas library is also loaded as the data returned by pytrends API into a pandas dataframe. This will make it ease to format and explore the data.

import pandas as pd 
from pytrends.request import TrendReq

pytrends = TrendReq()

The pytrends API has about nine methods. For my example I’ll be using the following:

  • Interest Over Time: returns historical, indexed data for when the keyword was searched most as shown on Google Trends’ Interest Over Time section.
  • Interest by Region: returns data for where the keyword is most searched as shown on Google Trends’ Interest by Region section.
  • Related Queries: returns data for the related keywords to a provided keyword shown on Google Trends’ Related Queries section.
  • Suggestions: returns a list of additional suggested keywords that can be used to refine a trend search.

Let’s now explore these APIs using the Databases as the main topic of investigation and examining some of the different products. I’ve used the db-engines.com website to select the top 5 databases (as per date of this blog post). These were:

  • Oracle
  • MySQL
  • SQL Server
  • PostgreSQL
  • MongoDB

I will use this list to look for number of searches and other related information. First thing is to import the necessary libraries and create the connection to Google Trends.

import pandas as pd 
from pytrends.request import TrendReq

pytrends = TrendReq()

Next setup the payload and keep the timeframe for searches to the past 12 months only.

search_list = ["Oracle", "MySQL", "SQL Server", "PostgreSQL", "MongoDB"] #max of 5 values allowed
pytrends.build_payload(search_list, timeframe='today 12-m')

We can now look at the the interest over time method to see the number of searches, based on a ranking where 100 is the most popular.

df_ot = pd.DataFrame(pytrends.interest_over_time()).drop(columns='isPartial')
df_ot

and to see a breakdown of these number on an hourly bases you can use the get_historical_interest method.

pytrends.get_historical_interest(search_list)

Let’s move on to exploring the level of interest/searches by country. The following retrieves this information, ordered by Oracle (in decending order) and then select the top 20 countries. Here we can see the relative number of searches per country. Note these doe not necessarily related to the countries with the largest number of searches

df_ibr = pytrends.interest_by_region(resolution='COUNTRY') # CITY, COUNTRY or REGION
df_ibr.sort_values('Oracle', ascending=False).head(20)

Visualizing data is always a good thing to do as we can see a patterns and differences in the data in a clearer way. The following takes the above query and creates a stacked bar chart.

import matplotlib
from matplotlib import pyplot as plt

df2 = df_ibr.sort_values('Oracle', ascending=False).head(20)

df2.reset_index().plot(x='geoName', y=['Oracle', 'MySQL', 'SQL Server', 'PostgreSQL', 'MongoDB'], kind ='bar', stacked=True, title="Searches by Country")

plt.rcParams["figure.figsize"] = [20, 8]
plt.xlabel("Country")
plt.ylabel("Ranking")

We can delve into the data more, by focusing on one particular country and examine the google searches by city or region. The following looks at the data from USA and gives the rankings for the various states.

pytrends.build_payload(search_list, geo='US')
df_ibr = pytrends.interest_by_region(resolution='COUNTRY', inc_low_vol=True)
df_ibr.sort_values('Oracle', ascending=False).head(20)

df2.reset_index().plot(x='geoName', y=['Oracle', 'MySQL', 'SQL Server', 'PostgreSQL', 'MongoDB'], kind ='bar', stacked=True, title="test")
plt.rcParams["figure.figsize"] = [20, 8]

plt.title("Searches for USA")
plt.xlabel("State")
plt.ylabel("Ranking")

 

We can find the top related queries and and top queries including the names of each database.

search_list = ["Oracle", "MySQL", "SQL Server", "PostgreSQL", "MongoDB"] #max of 5 values allowed
pytrends.build_payload(search_list, timeframe='today 12-m')

rq = pytrends.related_queries()
rq.values()

#display rising terms
rq.get('Oracle').get('rising')

We can see the top related rising queries for Oracle are about tik tok. No real surprise there!

and the top queries for Oracle included:

rq.get('Oracle').get('top')

This was an interesting exercise to do. I didn’t show all the results, but when you explore the other databases in the list and see the results from those, and then compare them across the five databases you get to see some interesting patterns.

 

Principal Component Analysis (PCA) in Oracle

Posted on Updated on

Principal Component Analysis (PCA), is a statistical process used for feature or dimensionality reduction in data science and machine learning projects. It summarizes the features of a large data set into a smaller set of features by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data’s variation as possible. There are lots of resources that goes into the mathematics behind this approach. I’m not going to go into that detail here and a quick internet search will get you what you need.

PCA can be used to discover important features from large data sets (large as in having a large number of features), while preserving as much information as possible.

Statistically, PCA finds lines, planes and hyper-planes in the K-dimensional space that approximate the data as well as possible in the least squares sense. A line or plane that is the least squares approximation of a set of data points makes the variance of the coordinates on the line or plane as large as possible.

Oracle has implemented PCA using Sigular Value Decomposition (SVD) on the covariance and correlations between variables, for feature extraction/reduction. PCA is closely related to SVD. PCA computes a set of orthonormal bases (principal components) that are ranked by their corresponding explained variance. The main difference between SVD and PCA is that the PCA projection is not scaled by the singular values. The extracted features are transformed features consisting of linear combinations of the original features.

When machine learning is performed on this reduced set of transformed features, it can completed with less resources and time, while still maintaining accuracy.

Algorithm Name in Oracle using

Mining Model Function = FEATURE_EXTRACTION

Algorithm = ALGO_SINGULAR_VALUE_DECOMP

(Hyper)-Parameters for algorithms

  • SVDS_U_MATRIX_OUTPUT : SVDS_U_MATRIX_ENABLE or SVDS_U_MATRIX_DISABLE
  • SVDS_SCORING_MODE : SVDS_SCORING_SVD or SVDS_SCORING_PCA
  • SVDS_SOLVER : possible values include SVDS_SOLVER_TSSVD, SVDS_SOLVER_TSEIGEN, SVDS_SOLVER_SSVD, SVDS_SOLVER_STEIGEN
  • SVDS_TOLERANCE : range of 0…1
  • SVDS_RANDOM_SEED : range of 0…4294967296 (!)
  • SVDS_OVER_SAMPLING : range of 1…5000
  • SVDS_POWER_ITERATIONS : Default value 2, with possible range of 0…20

Let’s work through an example using the MINING_DATA_BUILD_V data set that comes with Oracle Data Miner.

First step is to define the parameter settings for the algorithm. No data preparation is needed as the algorithm takes care of this. This means you can disable the Automatic Data Preparation (ADP).

-- create the parameter table
CREATE TABLE svd_settings (
setting_name VARCHAR2(30),
setting_value VARCHAR2(4000));

-- define the settings for SVD algorithm
BEGIN 
   INSERT INTO svd_settings (setting_name, setting_value) 
   VALUES (dbms_data_mining.algo_name, dbms_data_mining.algo_singular_value_decomp);

   -- turn OFF ADP
   INSERT INTO svd_settings (setting_name, setting_value) 
   VALUES (dbms_data_mining.prep_auto, dbms_data_mining.prep_auto_off); 

   -- set PCA scoring mode
   INSERT INTO svd_settings (setting_name, setting_value) 
   VALUES (dbms_data_mining.svds_scoring_mode, dbms_data_mining.svds_scoring_pca);

   INSERT INTO svd_settings (setting_name, setting_value) 
   VALUES (dbms_data_mining.prep_shift_2dnum, dbms_data_mining.prep_shift_mean); 

   INSERT INTO svd_settings (setting_name, setting_value) 
   VALUES (dbms_data_mining.prep_scale_2dnum, dbms_data_mining.prep_scale_stddev); 
END;
/

You are now ready to create the model.

BEGIN
   DBMS_DATA_MINING.CREATE_MODEL(
      model_name          => 'SVD_MODEL',
      mining_function     => dbms_data_mining.feature_extraction,
      data_table_name     => 'mining_data_build_v',
      case_id_column_name => 'CUST_ID',
      settings_table_name => 'svd_settings');
END;

When created you can use the mining model data dictionary views to explore the model and to explore the specifics of the model and the various MxN matrix created using the model specific views. These include:

  • DM$VESVD_Model : Singular Value Decomposition S Matrix
  • DM$VGSVD_Model : Global Name-Value Pairs
  • DM$VNSVD_Model : Normalization and Missing Value Handling
  • DM$VSSVD_Model : Computed Settings
  • DM$VUSVD_Model : Singular Value Decomposition U Matrix
  • DM$VVSVD_Model : Singular Value Decomposition V Matrix
  • DM$VWSVD_Model : Model Build Alerts

Where the S, V and U matrix contain:

  • U matrix : consists of a set of ‘left’ orthonormal bases
  • S matrix : is a diagonal matrix
  • V matrix : consists of set of ‘right’ orthonormal bases

These can be explored using the following

-- S matrix
select feature_id, VALUE, variance, pct_cum_variance 
from DM$VESVD_MODEL;

-- V matrix
select feature_id, attribute_name, value
from DM$VVSVD_MODEL
order by feature_id, attribute_name;

-- U matrix
select feature_id, attribute_name, value
from DM$VVSVD_MODEL
order by feature_id, attribute_name;

To determine the projections to be used for visualizations we can use the FEATURE_VALUES function.

select FEATURE_VALUE(svd_sh_sample, 1 USING *) proj1, 
       FEATURE_VALUE(svd_sh_sample, 2 USING *) proj2
from   mining_data_build_v 
where  cust_id <= 101510
order by 1, 2;

 

Other algorithms available in Oracle for feature extraction and reduction include:

  • Non-Negative Matrix Factorization (NMF)
  • Explicit Semantic Analysis (ESA)
  • Minimum Description Length (MDL) – this is really feature selection rather than feature extraction

OCI Data Science – Create a Project & Notebook, and Explore the Interface

Posted on Updated on

In my previous blog post I went through the steps of setting up OCI to allow you to access OCI Data Science. Those steps showed the setup and configuration for your Data Science Team.

Screenshot 2020-02-11 20.46.42

In this post I will walk through the steps necessary to create an OCI Data Science Project and Notebook, and will then Explore the basic Notebook environment.

1 – Create a Project

From the main menu on the Oracle Cloud home page select Data Science -> Projects from the menu.

Screenshot 2020-02-12 12.07.19

Select the appropriate Compartment in the drop-down list on the left hand side of the screen. In my previous blog post I created a separate Compartment for my Data Science work and team. Then click on the Create Projects button.

Screenshot 2020-02-12 12.09.11Enter a name for your project. I called this project, ‘DS-Demo-Project’. Click Create button.

Screenshot 2020-02-12 12.13.44

Screenshot 2020-02-12 12.14.44

That’s the Project created.

2 – Create a Notebook

After creating a project (see above) you can not create one or many Notebook Sessions.

To create a Notebook Session click on the Create Notebook Session button (see the above image).  This will create a VM to contain your notebook and associated work. Just like all VM in Oracle Cloud, they come in various different shapes. These can be adjusted at a later time to scale up and then back down based on the work you will be performing.

The following example creates a Notebook Session using the basic VM shape. I call the Notebook ‘DS-Demo-Notebook’. I also set the Block Storage size to 50G, which is the minimum value. The VNC details have been defaulted to those assigned to the Compartment. Click Create button at the bottom of the page.

Screenshot 2020-02-12 12.22.24

The Notebook Session VM will be created. This might take a few minutes. When created you will see a screen like the following.

Screenshot 2020-02-12 12.31.21

3 – Open the Notebook

After completing the above steps you can now open the Notebook Session in your browser.  Either click on the Open button (see above image), or copy the link and share with your data science team.

Important: There are a few important considerations when using the Notebooks. While the session is running you will be paying for it, even if the session got terminated at the browser or you lost connect. To manage costs, you may need to stop the Notebook session. More details on this in a later post.

After clicking on the Open button, a new browser tab will open and will ask you to log-in.

Screenshot 2020-02-12 12.35.26

After logging in you will see your Notebook.

Screenshot 2020-02-12 12.37.42

4 – Explore the Notebook Environment

The Notebook comes pre-loaded with lots of goodies.

The menu on the left-hand side provides a directory with lots of sample Notebooks, access to the block storage and a sample getting started Notebook.

Screenshot 2020-02-12 12.41.09

When you are ready to create your own Notebook you can click on the icon for that.

Screenshot 2020-02-12 12.42.50

Or if you already have a Notebook, created elsewhere, you can load that into your OCI Data Science environment.

Screenshot 2020-02-12 12.44.50

The uploaded Notebook will appear in the list on the left-hand side of the screen.

Data Science (The MIT Press Essential Knowledge series) – available in English, Korean and Chinese

Posted on Updated on

Back in the middle of 2018 MIT Press published my Data Science book, co-written with John Kelleher. It book was published as part of their Essentials Series.

During the few months it was available in 2018 it became a best seller on Amazon, and one of the top best selling books for MIT Press. This happened again in 2019. Yes, two years running it has been a best seller!

2020 kicks off with the book being translated into Korean and Chinese. Here are the covers of these translated books.

The Japanese and Turkish translations will be available in a few months!

Go get the English version of the book on Amazon in print, Kindle and Audio formats.

https://amzn.to/2qC84KN

This book gives a concise introduction to the emerging field of data science, explaining its evolution, relation to machine learning, current uses, data infrastructure issues and ethical challenge the goal of data science is to improve decision making through the analysis of data. Today data science determines the ads we see online, the books and movies that are recommended to us online, which emails are filtered into our spam folders, even how much we pay for health insurance.

Go check it out.

Amazon.com.          Amazon.co.uk

Screenshot 2020-02-05 11.46.03

Scottish Whisky Data Set – Updated

Posted on Updated on

The Scottish Whiskey data set consist of tasting notes and evaluations from 86 distilleries around Scotland. This data set has been around a long time andwas a promotional site for a book, Whisky Classified: Choosing Single Malts by Flavour. Written by David Wishart of the University of Saint Andrews, the book had its most recent printing in February 2012.

I’ve been using this data set in one of my conference presentations (Planning my Summer Vacation), but to use this data set I need to add 2 new attributes/features to the data set. Each of the attributes are listed below and the last 2 are the attributes I added. These were added to include the converted LAT and LONG comparable with Google Maps and other similar mapping technology.

Attributes include:

  • RowID
  • Distillery
  • Body
  • Sweetness
  • Smoky
  • Medicinal
  • Tobacco
  • Honey
  • Spicy
  • Winey
  • Nutty,
  • Malty,
  • Fruity,
  • Floral,
  • Postcode,
  • Latitude,
  • Longitude
  • lat  — newly added
  • long  — newly added

Here is the link to download and use this updated Scottish Whisky data set.

The original website is no longer available but if you have a look at the Internet Archive you will find the website.

Screenshot 2020-01-23 14.44.53

#GE2020 Analysing Party Manifestos using Python

Posted on

The general election is underway here in Ireland with polling day set for Saturday 8th February. All the politicians are out campaigning and every day the various parties are looking for publicity on whatever the popular topic is for that day. Each day is it a different topic.

Most of the political parties have not released their manifestos for the #GE2020 election (as of date of this post). I want to use some simple Python code to perform some analyse of their manifestos. As their new manifestos weren’t available (yet) I went looking for their manifestos from the previous general election. Michael Pidgeon has a website with party manifestos dating back to the early 1970s, and also has some from earlier elections. Check out his website.

I decided to look at manifestos from the 4 main political parties from the 2016 general election. Yes there are other manifestos available, and you can use the Python code, given below to analyse those, with only some minor edits required.

The end result of this simple analyse is a WordCloud showing the most commonly used words in their manifestos. This is graphical way to see what some of the main themes and emphasis are for each party, and also allows us to see some commonality between the parties.

Let’s begin with the Python code.

1 – Initial Setup

There are a number of Python Libraries available for processing PDF files. Not all of them worked on all of the Part Manifestos PDFs! It kind of depends on how these files were generated. In my case I used the pdfminer library, as it worked with all four manifestos. The common library PyPDF2 didn’t work with the Fine Gael manifesto document.

import io
import pdfminer
from pprint import pprint
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage

#directory were manifestos are located
wkDir = '.../General_Election_Ire/'

#define the names of the Manifesto PDF files & setup party flag
pdfFile = wkDir+'FGManifesto16_2.pdf'
party = 'FG'
#pdfFile = wkDir+'Fianna_Fail_GE_2016.pdf'
#party = 'FF'
#pdfFile = wkDir+'Labour_GE_2016.pdf'
#party = 'LB'
#pdfFile = wkDir+'Sinn_Fein_GE_2016.pdf'
#party = 'SF'

All of the following code will run for a given manifesto. Just comment in or out the manifesto you are interested in. The WordClouds for each are given below.

2 – Load the PDF File into Python

The following code loops through each page in the PDF file and extracts the text from that page.

I added some addition code to ignore pages containing the Irish Language. The Sinn Fein Manifesto contained a number of pages which were the Irish equivalent of the preceding pages in English. I didn’t want to have a mixture of languages in the final output.

SF_IrishPages = [14,15,16,17,18,19,20,21,22,23,24]
text = ""

pageCounter = 0
resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle)
page_interpreter = PDFPageInterpreter(resource_manager, converter)

for page in PDFPage.get_pages(open(pdfFile,'rb'), caching=True, check_extractable=True):
    if (party == 'SF') and (pageCounter in SF_IrishPages):
        print(party+' - Not extracting page - Irish page', pageCounter)
    else:
        print(party+' - Extracting Page text', pageCounter)
        page_interpreter.process_page(page)

        text = fake_file_handle.getvalue()

    pageCounter += 1

print('Finished processing PDF document')
converter.close()
fake_file_handle.close()
FG - Extracting Page text 0
FG - Extracting Page text 1
FG - Extracting Page text 2
FG - Extracting Page text 3
FG - Extracting Page text 4
FG - Extracting Page text 5
...

3 – Tokenize the Words

The next step is to Tokenize the text. This breaks the text into individual words.

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
tokens = []

tokens = word_tokenize(text)

print('Number of Pages =', pageCounter)
print('Number of Tokens =',len(tokens))
Number of Pages = 140
Number of Tokens = 66975

4 – Filter words, Remove Numbers & Punctuation

There will be a lot of things in the text that we don’t want included in the analyse. We want the text to only contain words. The following extracts the words and ignores numbers, punctuation, etc.

#converts to lower case, and removes punctuation and numbers
wordsFiltered = [tokens.lower() for tokens in tokens if tokens.isalpha()]
print(len(wordsFiltered))
print(wordsFiltered)
58198
['fine', 'gael', 'general', 'election', 'manifesto', 's', 'keep', 'the', 'recovery', 'going', 'gaelgeneral', 'election', 'manifesto', 'foreward', 'from', 'an', 'taoiseach', 'the', 'long', 'term', 'economic', 'three', 'steps', 'to', 'keep', 'the', 'recovery', 'going', 'agriculture', 'and', 'food', 'generational',
...

As you can see the number of tokens has reduced from 66,975 to 58,198.

5 – Setup Stop Words

Stop words are general words in a language that doesn’t contain any meanings and these can be removed from the data set. Python NLTK comes with a set of stop words defined for most languages.

#We initialize the stopwords variable which is a list of words like 
#"The", "I", "and", etc. that don't hold much value as keywords
stop_words = stopwords.words('english')
print(stop_words)
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself',
....

Additional stop words can be added to this list. I added the words listed below. Some of these you might expect to be in the stop word list, others are to remove certain words that appeared in the various manifestos that don’t have a lot of meaning. I also added the name of the parties  and some Irish words to the stop words list.

#some extra stop words are needed after examining the data and word cloud
#these are added
extra_stop_words = ['ireland','irish','ł','need', 'also', 'set', 'within', 'use', 'order', 'would', 'year', 'per', 'time', 'place', 'must', 'years', 'much', 'take','make','making','manifesto','ð','u','part','needs','next','keep','election', 'fine','gael', 'gaelgeneral', 'fianna', 'fáil','fail','labour', 'sinn', 'fein','féin','atá','go','le','ar','agus','na','ár','ag','haghaidh','téarnamh','bplean','page','two','number','cothromfor']
stop_words.extend(extra_stop_words)
print(stop_words)

Now remove these stop words from the list of tokens.

# remove stop words from tokenised data set
filtered_words = [word for word in wordsFiltered if word not in stop_words]
print(len(filtered_words))
print(filtered_words)
31038
['general', 'recovery', 'going', 'foreward', 'taoiseach', 'long', 'term', 'economic', 'three', 'steps', 'recovery', 'going', 'agriculture', 'food',

The number of tokens is reduced to 31,038

6 – Word Frequency Counts

Now calculate how frequently these words occur in the list of tokens.

#get the frequency of each word
from collections import Counter

# count frequencies
cnt = Counter()
for word in filtered_words:
cnt[word] += 1

print(cnt)
Counter({'new': 340, 'support': 249, 'work': 190, 'public': 186, 'government': 177, 'ensure': 177, 'plan': 176, 'continue': 168, 'local': 150, 
...

7 – WordCloud

We can use the word frequency counts to add emphasis to the WordCloud. The more frequently it occurs the larger it will appear in the WordCloud.

#create a word cloud using frequencies for emphasis 
from wordcloud import WordCloud
import matplotlib.pyplot as plt

wc = WordCloud(max_words=100, margin=9, background_color='white',
scale=3, relative_scaling = 0.5, width=500, height=400,
random_state=1).generate_from_frequencies(cnt)

plt.figure(figsize=(20,10))
plt.imshow(wc)
#plt.axis("off")
plt.show()

#Save the image in the img folder:
wc.to_file(wkDir+party+"_2016.png")

The last line of code saves the WordCloud image as a file in the directory where the manifestos are located.

8 – WordClouds for Each Party

Screenshot 2020-01-21 11.10.25

Remember these WordClouds are for the manifestos from the 2016 general election.

When the parties have released their manifestos for the 2020 general election, I’ll run them through this code and produce the WordClouds for 2020. It will be interesting to see the differences between the 2016 and 2020 manifesto WordClouds.

Demographics vs Psychographics for Machine Learning

Posted on Updated on

When preparing data for data science, data mining or machine learning projects you will create a data set that describes the various characteristics of the subject or case record. Each attribute will contain some descriptive information about the subject and is related to the target variable in some way.

In addition to these attributes, the data set will be enriched with various other internal/external data to complete the data set.

Some of the attributes in the data set can be grouped under the heading of Demographics. Demographic data contains attributes that explain or describe the person or event each case record is focused on. For example, if the subject of the case record is based on Customer data, this is the “Who” the demographic data (and features/attributes) will be about. Examples of demographic data include:

  • Age range
  • Marital status
  • Number of children
  • Household income
  • Occupation
  • Educational level

These features/attributes are typically readily available within your data sources and if they aren’t then these name be available from a purchased data set.

Additional feature engineering methods are used to generate new features/attributes that express meaning is different ways. This can be done by combining features in different ways, binning, dimensionality reduction, discretization, various data transformations, etc. The list can go on.

The aim of all of this is to enrich the data set to include more descriptive data about the subject. This enriched data set will then be used by the machine learning algorithms to find the hidden patterns in the data. The richer and descriptive the data set is the greater the likelihood of the algorithms in detecting the various relationships between the features and their values. These relationships will then be included in the created/generated model.

Another approach to consider when creating and enriching your data set is move beyond the descriptive features typically associated with Demographic data, to include Pyschographic data.

Psychographic data is a variation on demographic data where the feature are about describing the habits of the subject or customer.  Demographics focus on the “who” while psycographics focus on the “why”. For example, a common problem with data sets is that they describe subjects/people who have things in common. In such scenarios we want to understand them at a deeper level. Psycographics allows us to do this. Examples of Psycographics include:

  • Lifestyle activities
  • Evening activities
  • Purchasing interests – quality over economy,  how environmentally concerned are you
  • How happy are you with work, family, etc
  • Social activities and changes in these
  • What attitudes you have for certain topic areas
  • What are your principles and beliefs

The above gives a far deeper insight into the subject/person and helps to differentiate each subject/person from each other, when there is a high similarity between all subjects in the data set. For example, demographic information might tell you something about a person’s age, but psychographic information will tell you that the person is just starting a family and is in the market for baby products.

I’ll close with this. Consider the various types of data gathering that companies like Google, Facebook, etc perform. They gather lots of different types of data about individuals. This allows them to build up a complete and extensive profile of all activities for individuals. They can use this to deliver more accurate marketing and advertising. For example, Google gathers data about what places to visit throughout a data, they gather all your search results, and lots of other activities. They can do a lot with this data. but now they own Fitbit. Think about what they can do with that data and particularly when combined with all the other data they have about you. What if they had access to your medical records too!  Go Google this ! You will find articles about them now having access to your health records. Again combine all of the data from these different data sources. How valuable is that data?