Big Data

Listed in 2 categories of “Who’s Who in Data Science & Machine Learning?”

Posted on March 18, 2021

I’ve received notification I’ve been listed in the “Who’s Who in Data Science & Machine Learning?” lists created by Onalytica. I’ve been listed in not just one category but two categories. These are:

Key Opinion Leaders discussing Data Science & Machine Learning
Big Data

This is what Onalytica says about their report and how the list for each category was put together. “The influential experts are selected using Onalytica’s 4 Rs methodology (Reach, Resonance, Relevance and Reference). Quantitative data is pulled through LinkedIn, Twitter, Personal Blogs, YouTube, Podcast, and Forbes channels, and our qualitative data is pulled by our insights and analytics team, capturing offline influenc”. “All the influential experts featured are categorised by influencer persona, the sector they work in, their role within that sector, and more from our curated database of 1m+ influencers”. “Our Who’s Who lists are created using the Onalytica platform which has a curated database of over 1 million influencers. Our platform allows you to discover, validate and categorise influencers quickly and easily via keyword searches. Our lists are made using carefully created Boolean queries which then rank influencers by resonance, relevance, reach and reference, meaning influencers are not only ranked by themselves, but also by how much other influencers are referring to them. The lists are then validated, and filters are used to split the influencers up into the categories that are seen in the list.”

Check out the full report on “Who’s Who in Data Science & Machine Learning?“

This entry was posted in Big Data, Data Science, Machine Learning and tagged Big Data, Data Science, Key Opinion Leader, Machine Learning, Thought Leader, Top10.

Data Science (The MIT Press Essential Knowledge series) – available in English, Korean and Chinese

Posted on February 10, 2020 Updated on February 13, 2020

Back in the middle of 2018 MIT Press published my Data Science book, co-written with John Kelleher. It book was published as part of their Essentials Series.

During the few months it was available in 2018 it became a best seller on Amazon, and one of the top best selling books for MIT Press. This happened again in 2019. Yes, two years running it has been a best seller!

2020 kicks off with the book being translated into Korean and Chinese. Here are the covers of these translated books.

The Japanese and Turkish translations will be available in a few months!

Go get the English version of the book on Amazon in print, Kindle and Audio formats.

This book gives a concise introduction to the emerging field of data science, explaining its evolution, relation to machine learning, current uses, data infrastructure issues and ethical challenge the goal of data science is to improve decision making through the analysis of data. Today data science determines the ads we see online, the books and movies that are recommended to us online, which emails are filtered into our spam folders, even how much we pay for health insurance.

Go check it out.

Amazon.com. Amazon.co.uk

Screenshot 2020-02-05 11.46.03

This entry was posted in Big Data, Data Science, Machine Learning and tagged Analytics, Big Data, Data Science, Machine Learning.

A selection of Hadoop Docker Images

Posted on August 2, 2018

When it comes to big data platforms one of the biggest challenges is getting a test environment setup where you can try out the various components. There are a few approaches to doing this this. The first is to setup your own virtual machine or some other container with the software. But this can be challenging to get just a handful of big data applications/software to work on one machine.

But there is an alternative approach. You can use one of the preconfigured environments from the likes of AWS, Google, Azure, Oracle, etc. But in most cases these come with a cost. Maybe not in the beginning but after a little us you will need to start handing over some dollars. But these require you to have access to the cloud i.e. wifi, to run these. Again not always possible!

So what if you want to have a local big data and Hadoop environment on your own PC or laptop or in your home or office test lab? There ware a lot of Virtual Machines available. But most of these have a sizeable hardware requirement. Particularly for memory, with many requiring 16+G of RAM ! Although in more recent times this might not be a problem but for many it still is. Your machines do not have that amount or your machine doesn’t allow you to upgrade.

What can you do?

Have you considered using Docker? There are many different Hadoop Docker images available and these are not as resource or hardware hungry, unlike the Virtual Machines.

Here is a list of some that I’ve tried out and you might find them useful.

Cloudera QuickStart image

You may have tried their VM, now go dry the Cloudera QuickStart docker image.

Read about it here.

Check our Docker Hub for lots and lots of images.

Docker Hub is not the only place to get Hadoop Docker images. There are lots on GitHub
Just do a quick Google search to find the many, many, many images.

These Docker Hadoop images are a great way for you to try out these Big Data platforms and environments with the minimum of resources.

This entry was posted in Uncategorized and tagged Big Data, Docker, Hadoop.

Lessor known Apache Machine Learning Languages

Posted on July 23, 2018 Updated on July 22, 2018

Machine learning is a very popular topic in recent times, and we keep hearing about languages such as R, Python and Spark. In addition to these we have commercially available machine learning languages and tools from SAS, IBM, Microsoft, Oracle, Google, Amazon, etc., etc. Everyone want a slice of the machine learning market!

The Apache Foundation supports the development of new open source projects in a number of areas. One such area is machine learning. If you have read anything about machine learning you will have come across Spark, and maybe you might believe that everyone is using it. Sadly this isn’t true for lots of reasons, but it is very popular. Spark is one of the project support by the Apache Foundation.

But are there any other machine learning projects being supported under the Apache Foundation that are an alternative to Spark? The follow lists the alternatives and lessor know projects: (most of these are incubator/retired/graduated Apache projects)

Flink	Flink is an open source system for expressive, declarative, fast, and efficient data analysis. Stratosphere combines the scalability and programming flexibility of distributed MapReduce-like platforms with the efficiency, out-of-core execution, and query optimization capabilities found in parallel databases. Flink was originally known as Stratosphere when it entered the Incubator. Documentation (graduated)
HORN	HORN is a neuron-centric programming APIs and execution framework for large-scale deep learning, built on top of Apache Hama. Wiki Page (Retired)
HiveMail	Hivemall is a library for machine learning implemented as Hive UDFs/UDAFs/UDTFs Apache Hivemall offers a variety of functionalities: regression, classification, recommendation, anomaly detection, k-nearest neighbor, and feature engineering. It also supports state-of-the-art machine learning algorithms such as Soft Confidence Weighted, Adaptive Regularization of Weight Vectors, Factorization Machines, and AdaDelta. Apache Hivemall offers a variety of functionalities: regression, classification, recommendation, anomaly detection, k-nearest neighbor, and feature engineering. It also supports state-of-the-art machine learning algorithms such as Soft Confidence Weighted, Adaptive Regularization of Weight Vectors, Factorization Machines, and AdaDelta. Documentation (incubator)
MADlib	Apache MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data. Key features include: Operate on the data locally in-database. Do not move data between multiple runtime environments unnecessarily; Utilize best of breed database engines, but separate the machine learning logic from database specific implementation details; Leverage MPP shared nothing technology, such as the Greenplum Database and Apache HAWQ (incubating), to provide parallelism and scalability. Documentation (graduated)
MXNet	A Flexible and Efficient Library for Deep Learning . MXNet provides optimized numerical computation for GPUs and distributed ecosystems, from the comfort of high-level environments like Python and R MXNet automates common workflows, so standard neural networks can be expressed concisely in just a few lines of code. Webpage (incubator)
OpenNLP	OpenNLP is a machine learning based toolkit for the processing of natural language text. OpenNLP supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, language detection and coreference resolution. Documentation (graduated)
PredictionIO	PredictionIO is an open source Machine Learning Server built on top of state-of-the-art open source stack, that enables developers to manage and deploy production-ready predictive services for various kinds of machine learning tasks. Documentation (graduated)
SAMOA	SAMOA provides a collection of distributed streaming algorithms for the most common data mining and machine learning tasks such as classification, clustering, and regression, as well as programming abstractions to develop new algorithms that run on top of distributed stream processing engines (DSPEs). It features a pluggable architecture that allows it to run on several DSPEs such as Apache Storm, Apache S4, and Apache Samza. Documentation (incubator)
SINGA	SINGA is a distributed deep learning platform. An intuitive programming model based on the layer abstraction is provided, which supports a variety of popular deep learning models. SINGA architecture supports both synchronous and asynchronous training frameworks. Hybrid training frameworks can also be customized to achieve good scalability. SINGA provides different neural net partitioning schemes for training large models. Documentation (incubator)
Storm	Storm is a distributed, fault-tolerant, and high-performance realtime computation system that provides strong guarantees on the processing of data. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language. Documentation (graduated)
SystemML	SystemML provides declarative large-scale machine learning (ML) that aims at flexible specification of ML algorithms and automatic generation of hybrid runtime plans ranging from single node, in-memory computations, to distributed computations such as Apache Hadoop MapReduce and Apache Spark. Documentation (graduated)

Big data ml

I will have a closer look that the following SQL based machine learning languages in a lager blog post:

– MADlib

– Storm

This entry was posted in Big Data, Machine Learning, Open Source, Uncategorized and tagged Big Data, Machine Learning, Open Source.

Configuring RStudio Server for Oracle R Enterprise

Posted on March 22, 2016 Updated on March 18, 2016

In this blog post I will show you the configurations that are necessary for RStudio Server to work with Oracle R Enterprise on your Oracle Database server. In theory if you have just installed ORE and then RStudio Server, everything should work, but if you encounter any issues then check out the following.

Before I get started make sure to check out my previous blog posts on installing R Studio Server. The first blog post was installing and configuring RStudio Server on the Oracle BigDataLite VM. This is an automated install. The second blog post was a step by step guide to installing RStudio Server on your (Oracle) Linux Database Server and how to open the port on the VM using VirtualBox.

Right. Let’s get back to configuring to work with Oracle R Enterprise. The following assumes you have complete the second blog post mentioned above.

1. Edit the rserver.conf files

Add in the values and locations for RHOME and ORACLE_HOME

sudo vi /etc/rstudio/rserver.conf
    rsession-ld-library-path=RHOME/lib:ORACLE_HOME/lib

2. Edit the .Renviron file.

Add in the values for ORACLE_HOME, ORACLE_HOSTNAME and ORACLE_SID

cd /home/oracle
sudo vi .Renviron
    ORACLE_HOME=ORACLE_HOME
    ORACLE_HOSTNAME=ORACLE_HOSTNAME
    ORACLE_SID=ORACLE_SID
 
export ORACLE_HOME
export ORACLE_HOSTNAME
export ORACLE_SID

3. To access the Oracle R Distribution

Add the following to the usr/lib/rstudio-server/R/modules/SessionHelp.R file for the version of Oracle R Distribution you installed prior to installing Oracle R Enterprise.

.rs.addFunction( "httpdPortIsFunction", function() {
   getRversion() >= "3.2"
})

You are all done now with all the installations and configurations.

This entry was posted in Oracle R Enterprise, R and tagged Big Data, Oracle R Enterprise, ORE, R.

Installing RStudio Server on an (Oracle) Linux server

Posted on March 17, 2016

In a previous blog post I showed how you can install and get started with using RStudio on a server by using RStudio Server. My previous post showed how you could do that on the Oracle BigDataLite VM. On this VM everything was nicely scripted and set up for you. But when it comes to installing it on a different server, well things can be a bit different.

The purpose of this blog post is to go through the install steps you need to follow on your own server or Oracle Database server. The following is based on a server that is setup with Oracle Linux. (I’m actually using the Oracle DB Developer VM).

1. Download the latest version of RStudio Server.

Use the following link to download RStudio Server. But do a quick check on the RStudio server to get the current version number.

wget https://download2.rstudio.org/rstudio-server-rhel-0.99.892-x86_64.rpm

The following shows you what you will see when you run this command.

--2016-03-16 06:22:30--  https://download2.rstudio.org/rstudio-server-rhel-0.99.892-x86_64.rpm
Resolving download2.rstudio.org (download2.rstudio.org)... 54.192.28.107, 54.192.28.54, 54.192.28.12, ...
Connecting to download2.rstudio.org (download2.rstudio.org)|54.192.28.107|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 38814908 (37M) [application/x-redhat-package-manager]
Saving to: ‘rstudio-server-rhel-0.99.892-x86_64.rpm’

100%[============================================================>] 38,814,908  6.54MB/s   in 6.0s  

2016-03-16 06:22:37 (6.17 MB/s) - ‘rstudio-server-rhel-0.99.892-x86_64.rpm’ saved [38814908/38814908]

2. Install RStudio Server

sudo yum install --nogpgcheck rstudio-server-rhel-0.99.892-x86_64.rpm

when prompted if it is OK to install, enter y (highlighted in bold below)

Loaded plugins: langpacks
Examining rstudio-server-rhel-0.99.892-x86_64.rpm: rstudio-server-0.99.892-1.x86_64
Marking rstudio-server-rhel-0.99.892-x86_64.rpm to be installed
Resolving Dependencies
--> Running transaction check
---> Package rstudio-server.x86_64 0:0.99.892-1 will be installed
--> Finished Dependency Resolution
ol7_UEKR3/x86_64                                                                    | 1.2 kB  00:00:00    
ol7_addons/x86_64                                                                   | 1.2 kB  00:00:00    
ol7_latest/x86_64                                                                   | 1.4 kB  00:00:00    
ol7_optional_latest/x86_64                                                          | 1.2 kB  00:00:00    

Dependencies Resolved

===========================================================================================================
 Package               Arch          Version             Repository                                   Size
===========================================================================================================
Installing:
 rstudio-server        x86_64        0.99.892-1          /rstudio-server-rhel-0.99.892-x86_64        280 M

Transaction Summary
===========================================================================================================
Install  1 Package

Total size: 280 M
Installed size: 280 M

Is this ok [y/d/N]: y

Downloading packages:
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  Installing : rstudio-server-0.99.892-1.x86_64                                                        1/1
groupadd: group 'rstudio-server' already exists
rsession: no process found
ln -s '/etc/systemd/system/rstudio-server.service' '/etc/systemd/system/multi-user.target.wants/rstudio-server.service'
rstudio-server.service - RStudio Server
   Loaded: loaded (/etc/systemd/system/rstudio-server.service; enabled)
   Active: active (running) since Wed 2016-03-16 10:46:00 PDT; 1s ago
  Process: 3191 ExecStart=/usr/lib/rstudio-server/bin/rserver (code=exited, status=0/SUCCESS)
 Main PID: 3192 (rserver)
   CGroup: /system.slice/rstudio-server.service
           ├─3192 /usr/lib/rstudio-server/bin/rserver
           └─3205 /usr/lib64/R/bin/exec/R --slave --vanilla -e cat(R.Version()$major,R.Version()$minor,~+~sep=".")

Mar 16 10:46:00 localhost.localdomain systemd[1]: Started RStudio Server.
  Verifying  : rstudio-server-0.99.892-1.x86_64                                                        1/1

Installed:
  rstudio-server.x86_64 0:0.99.892-1                                                                      

Complete!

3. Open RStudio using a web browser.

Open your favourite web browser and put in the host name or the IP address of your server. In my example I’m using the Oracle DB Developer VM to demonstrate the install, so I can use localhost, followed by the port number for RStudio Server.

NewImage

4. Use and Enjoy

If you get logged into RStudio Server then you will see a screen something like the following!

Job Done and Enjoy!

5. An Extra Step is using the Oracle DB Developer VM

If you want to use RStudio on the Oracle DB Developer VM from your local OS, then you will need to open the port 8787 on the VM. To do this power down the VM, if you have it open. The open the Network section of the VM settings. I’m using VirtualBox. And then click on the Port Forwarding.

NewImage

Click on OK to save your Port Forwarding setting and then click on the OK button again to close the Network settings for the VM.

Now start up the VM. When it has loaded and you have the desktop displayed in the VM window, you should now be able to connect to RStudio in the VM, from your local machine.

To do this open your web browser on your local machine and type in

http://localhost:8787

You should now get the RStudio login in screen that is shown in point 3 above. Go ahead, login and enjoy.

6. A little warning

Make sure to log out of RStudio when you are finished using it. If you don’t then your R environment may not have been saved and you will get a message when you log in next. Now we don’t want that happenings, so just log out of RStudio. You can do that by looking at the top right hand corner of the RStudio Server application.

I will have one more blog post on how you can configure RStudion Server to work with an Oracle Database server that has Oracle R Enterprise installed.

This entry was posted in Oracle R Enterprise, Predictive Analytics, R, VirtualBox and tagged Big Data, R, RStudio, VirtualBox.

Spark versus Flink

Posted on February 12, 2016

Spark is an open source Apache project that provides a framework for multi stage in-memory analytics. Spark is based on the Hadoop platform and can interface with Cassandra OpenStack Swift, Amazon S3, Kudu and HDFS. Spark comes with a suite of analytic and machine learning algorithm allowing you to perform a wide variety of analytics on you distribute Hadoop platform. This allows you to generate data insights, data enrichment and data aggregations for storage on Hadoop and to be used on other more main stream analytics as part of your traditional infrastructure. Spark is primarily aimed at batch type analytics but it does come with a capabilities for streaming data. When data needs to be analysed it is loaded into memory and the results are then written back to Hadoop.

NewImage

Flink is another open source Apache project that provides a platform for analyzing and processing data that is in a distributed stream and/or batch data processing. Similarly to Spark, Flink comes with a set of APIs that allows for each integration in with Java, Scala and Python. The machine learning algorithms have been specifically tuned to work with streaming data specifically but can also work in batch oriented data. As Flink is focused on being able to process streaming data, it run on Yarn, works with HDFS, can be easily integrated with Kafka and can connect to various other data storage systems.

NewImage

Although both Spark and Flink can process streaming data, when you examine the underlying architecture of these tools you will find that Flink is more specifically focused for streaming data and can process this data in a more efficient manner.

There has been some suggestions in recent weeks and months that Spark is now long the tool of choice for analytics on Hadoop. Instead everyone should be using Flink or something else. Perhaps it is too early to say this. You need to consider the number of companies that have invested significant amount of time and resources building and releasing products on top of Spark. These two products provide similar-ish functionality but each product are designed to process this data in a different manner. So it really depends on what kind of data you need to process, if it is bulk or streaming will determine which of these products you should use. In some environments it may be suitable to use both.

Will these tool replace the more traditional advanced analytics tools in organisations? the simple answer is No they won’t replace them. Instead they will complement each other and if you have a Hadoop environment you will will probably end up using Spark to process the data on Hadoop. All other advanced analytics that are part of your more traditional environments you will use the traditional advanced analytics tools from the more main stream vendors.

This entry was posted in Uncategorized and tagged Big Data, Flink, Hadoop, Spark.

Ora-lytics

By Brendan Tierney

Big Data

Listed in 2 categories of “Who’s Who in Data Science & Machine Learning?”

Data Science (The MIT Press Essential Knowledge series) – available in English, Korean and Chinese

A selection of Hadoop Docker Images

Lessor known Apache Machine Learning Languages

Configuring RStudio Server for Oracle R Enterprise

Installing RStudio Server on an (Oracle) Linux server

Spark versus Flink

Big Data

Listed in 2 categories of “Who’s Who in Data Science & Machine Learning?”

Share this:

Data Science (The MIT Press Essential Knowledge series) – available in English, Korean and Chinese

Share this:

A selection of Hadoop Docker Images

Share this:

Lessor known Apache Machine Learning Languages

Share this:

Configuring RStudio Server for Oracle R Enterprise

Share this:

Installing RStudio Server on an (Oracle) Linux server

Share this:

Spark versus Flink

Share this: