HiveML : Using SQL for ML on Big Data

Posted on Updated on

It is widely recognised that SQL is one of the core languages that every data scientist needs to know. Not just know but know really well. If you are going to be working with data (big or small) you are going to use SQL to access the data. You may use some other tools and languages as part of your data science role, but for processing data SQL is king.

During the era of big data and hadoop it was all about moving the code to where the data was located. Over time we have seem a number of different languages and approaches being put forward to allow us to process the data in these big environments. One of the most common one is Spark. As with all languages there can be a large learning curve, and as newer languages become popular, the need to change and learn new languages is becoming a lot more frequent.

We have seen many of the main stream database vendors including machine learning in their databases, thereby allowing users to use machine learning using SQL. In the big data world there has been many attempts to do this, to building some SQL interfaces for machine learning in a big data environment.

One such (newer) SQL machine learning engine is called HiveMall. This will allow anyone with a basic level knowledge of SQL to quickly learn machine learning. Apache Hivemall is built to be a scalable machine learning library that runs on Apache Hive, Apache Spark, and Apache Pig.

Screenshot 2019-02-16 09.46.39

Hivemall is currently at incubator stage under Apache and version 0.6 was released in December 2018.

I’ve a number of big data/hadoop environments in my home lab and build on a couple of cloud vendors (Oracle and AWS). I’ve completed the installation of Hivemall easily on my Oracle BigDataLite VM and my own custom build Hadoop environment on Oracle cloud. A few simple commands you will have Hivemall up and running. Initially installed for just Hive and then updated to use Spark.

Hivemall expands the analytical functions available in Hive, as well as providing data preparation and the typical range of machine learning functions that are necessary for 97+% of all machine learning use cases.

Download the hivemall-core-xxx-with-dependencies.jar file

# Setup Your Environment $HOME/.hiverc
add jar /home/myui/tmp/hivemall-core-xxx-with-dependencies.jar; 
source /home/myui/tmp/define-all.hive;

This automatically loads all Hivemall functions every time you start a Hive session

# Create a directory in HDFS for the JAR 
hadoop fs -mkdir -p /apps/hivemall 
hdfs dfs -chmod -R 777 /apps/hivemall 
cp hivemall-core-0.4.2-rc.2-with-dependencies.jar hivemall-with-dependencies.jar 
hdfs dfs -put hivemall-with-dependencies.jar /apps/hivemall/ 
hdfs dfs -put hivemall-with-dependencies.jar /apps/hive/warehouse

You might want to create a new DB in Hive for your Hivemall work.

USE hivemall;

Then list all the Hivemall functions

show functions "hivemall.*";

| tab_name                                |
| hivemall.add_bias                       |
| hivemall.add_feature_index              |
| hivemall.amplify                        |
| hivemall.angular_distance               |
| hivemall.angular_similarity             |

Hivemall for ML using SQL is now up and running. Next step is to do try out the various analytical and ML functions.