Back in the middle of 2018 MIT Press published my Data Science book, co-written with John Kelleher. It book was published as part of their Essentials Series.
During the few months it was available in 2018 it became a best seller on Amazon, and one of the top best selling books for MIT Press. This happened again in 2019. Yes, two years running it has been a best seller!
2020 kicks off with the book being translated into Korean and Chinese. Here are the covers of these translated books.
The Japanese and Turkish translations will be available in a few months!
Go get the English version of the book on Amazon in print, Kindle and Audio formats.
This book gives a concise introduction to the emerging field of data science, explaining its evolution, relation to machine learning, current uses, data infrastructure issues and ethical challenge the goal of data science is to improve decision making through the analysis of data. Today data science determines the ads we see online, the books and movies that are recommended to us online, which emails are filtered into our spam folders, even how much we pay for health insurance.
Go check it out.
Over the past 18 months or so most of the examples of using machine learning have been on looking at images and identifying objects in them. There are the typical examples of examining pictures looking for a Cat or a Dog, or some famous person, etc. Most of these examples are very noddy, although they do illustrate important examples.
But what if this same technology was used to monitor people going about their daily lives. What if pictures and/or video was captured of you as you walked down the street or on your way to work or to a meeting. These pictures and videos are being taken of you without you knowing.
And this raises a wide range of Ethical concerns. There are the ethics of deploying such solutions in the public domain, but there are also ethical concerns for the data scientists, machine learner, and other people working on these projects. “Just because we can, doesn’t mean we should”. People need to decide, if they are working on one of these projects, if they should be working on it and if not what they can do.
Ethics are the principals of behavior based on ideas of right and wrong. Ethical principles often focus on ideas such as fairness, respect, responsibility, integrity, quality, transparency and trust. There is a lot in that statement on Ethics, but we all need to consider that is right and what is wrong. But instead of wrong, what is grey-ish, borderline scenarios.
Here are some examples that might fall into the grey-ish space between right and wrong. Why they might fall more towards the wrong is because most people are not aware their image is being captured and used, not just for a particular purpose at capture time, but longer term to allow for better machine learning models to be built.
Can you imagine walking down the street with a digital display in front of you. That display is monitoring you, and others, and then presents personalized adverts on the digital display aim specifically at you. A classify example of this is in the film Minority Report. This is no longer science fiction.
This is happening at the Westfield shopping center in London and in other cities across UK and Europe. These digital advertisement screens are monitoring people, identifying their personal characteristics and then customizing the adverts to match in with the profile of the people walking past. This solutions has been developed and rolled out by Ocean Out Door. They are using machine learning to profile the individual people based on gender, age, facial hair, eye wear, mood, engagement, attention time, group size, etc. They then use this information to:
- Optimisation – delivering the appropriate creative to the right audience at the right time.
- Visualise – Gaze recognition to trigger creative or an interactive experience
- AR Enabled – Using the HD cameras to create an augmented reality mirror or window effect, creating deep consumer engagement via the latest technology
- Analytics – Understanding your brand’s audience, post campaign analysis and creative testing
Face Plus Plus can monitor people walking down the street and do similar profiling, and can bring it to another level where by they can identify what clothing you are wearing and what the brand is. Image if you combine this with location based services. An example of this, imagine you are walking down the high street or a major retail district. People approach you trying to entice you into going into a particular store, and they offer certain discounts. But you are with a friend and the store is not interested in them.
The store is using video monitoring, capturing details of every person walking down the street and are about to pass the store. The video is using machine/deep learning to analyze you profile and what brands you are wearing. The store as a team of people who are deployed to stop and engage with certain individuals, just because they make the brands or interests of the store and depending on what brands you are wearing can offer customized discounts and offers to you.
How comfortable would you be with this? How comfortable would you be about going shopping now?
For me, I would not like this at all, but I can understand why store and retail outlets are interested, as they are all working in a very competitive market trying to maximize every dollar or euro they can get.
Along side the ethical concerns, we also have some legal aspects to consider. Some of these are a bit in the grey-ish area, as some aspects of these kind of scenarios are slightly addresses by EU GDPR and the EU Artificial Intelligence guidelines. But what about other countries around the World. Then it comes to training and deploying these facial models, they are dependent on having a good training data set. This means they needs lots and lots of pictures of people and these pictures need to be labelled with descriptive information about the person. For these public deployments of facial recognition systems, then will need more and more training samples/pictures. This will allow the models to improve and evolve over time. But how will these applications get these new pictures? They claim they don’t keep any of the images of people. They only take the picture, use the model on it, and then perform some action. They claim they do not keep the images! But how can they improve and evolve their solution?
I’ll have another blog post giving more examples of how machine/deep learning, video and image captures are being used to monitor people going about their daily lives.
Ethics is one of those topics that everyone has a slightly different definition or view of what it means. The Oxford English dictionary defines ethics as, ‘Moral principles that govern a person’s behaviour or the conducting of an activity‘.
As you can imagine this topic can be difficult to discuss and has many, many different aspects.
In the era of AI, Machine Learning, Data Science, etc the topic of Ethics is finally becoming an important topic. Again there are many perspective on this. I’m not going to get into these in this blog post, because if I did I could end up writing a PhD dissertation on it.
But if you do work in the area of AI, Machine Learning, Data Science, etc you do need to think about the ethical aspects of what you do. For most people, you will be working on topics where ethics doesn’t really apply. For example, examining log data, looking for trends, etc
But when you start working of projects examining individuals and their behaviours then you do need to examine the ethical aspects of such work. Everyday we experience adverts, web sites, marketing, etc that has used AI, Machine Learning and Data Science to delivery certain product offerings to us.
Just because we can do something, doesn’t mean we should do it.
One particular area that I will not work on is Location Based Advertising. Imagine walking down a typical high street with lots and lots of retail stores. Your phone vibrates and on the screen there is a message. The message is a special offer or promotion for one of the shops a short distance ahead of you. You are being analysed. Your previous buying patterns and behaviours are being analysed, Your location and direction of travel is being analysed. Some one, or many AI applications are watching you. This is not anything new and there are lots of examples of this from around the world.
But what if this kind of Location Based Advertising was taken to another level. What if the shops had cameras that monitored the people walking up and down the street. What if those cameras were analysing you, analysing what clothes you are wearing, analysing the brands you are wearing, analysing what accessories you have, analysing your body language, etc. They are trying to analyse if you are the kind of person they want to sell to. They then have staff who will come up to you, as you are walking down the street, and will have customised personalised special offers on products in their store, just for you.
See the segment between 2:00 and 4:00 in this video. This gives you an idea of what is possible.
Are you Ok with this?
As an AI, Machine Learning, Data Science professional, are you Ok with this?
The technology exists to make this kind of Location Based Marketing possible. This will be an increasing ethical consideration over the coming years for those who work in the area of AI, Machine Learning, Data Science, etc
Just because we can, doesn’t mean we should!
Machine learning is a very popular topic in recent times, and we keep hearing about languages such as R, Python and Spark. In addition to these we have commercially available machine learning languages and tools from SAS, IBM, Microsoft, Oracle, Google, Amazon, etc., etc. Everyone want a slice of the machine learning market!
The Apache Foundation supports the development of new open source projects in a number of areas. One such area is machine learning. If you have read anything about machine learning you will have come across Spark, and maybe you might believe that everyone is using it. Sadly this isn’t true for lots of reasons, but it is very popular. Spark is one of the project support by the Apache Foundation.
But are there any other machine learning projects being supported under the Apache Foundation that are an alternative to Spark? The follow lists the alternatives and lessor know projects: (most of these are incubator/retired/graduated Apache projects)
|Flink||Flink is an open source system for expressive, declarative, fast, and efficient data analysis. Stratosphere combines the scalability and programming flexibility of distributed MapReduce-like platforms with the efficiency, out-of-core execution, and query optimization capabilities found in parallel databases. Flink was originally known as Stratosphere when it entered the Incubator.
|HORN||HORN is a neuron-centric programming APIs and execution framework for large-scale deep learning, built on top of Apache Hama.
|HiveMail||Hivemall is a library for machine learning implemented as Hive UDFs/UDAFs/UDTFs
Apache Hivemall offers a variety of functionalities: regression, classification, recommendation, anomaly detection, k-nearest neighbor, and feature engineering. It also supports state-of-the-art machine learning algorithms such as Soft Confidence Weighted, Adaptive Regularization of Weight Vectors, Factorization Machines, and AdaDelta. Apache Hivemall offers a variety of functionalities: regression, classification, recommendation, anomaly detection, k-nearest neighbor, and feature engineering. It also supports state-of-the-art machine learning algorithms such as Soft Confidence Weighted, Adaptive Regularization of Weight Vectors, Factorization Machines, and AdaDelta.
|MADlib||Apache MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data. Key features include: Operate on the data locally in-database. Do not move data between multiple runtime environments unnecessarily; Utilize best of breed database engines, but separate the machine learning logic from database specific implementation details; Leverage MPP shared nothing technology, such as the Greenplum Database and Apache HAWQ (incubating), to provide parallelism and scalability.
|MXNet||A Flexible and Efficient Library for Deep Learning . MXNet provides optimized numerical computation for GPUs and distributed ecosystems, from the comfort of high-level environments like Python and R MXNet automates common workflows, so standard neural networks can be expressed concisely in just a few lines of code.
|OpenNLP||OpenNLP is a machine learning based toolkit for the processing of natural language text. OpenNLP supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, language detection and coreference resolution.
|PredictionIO||PredictionIO is an open source Machine Learning Server built on top of state-of-the-art open source stack, that enables developers to manage and deploy production-ready predictive services for various kinds of machine learning tasks.
|SAMOA||SAMOA provides a collection of distributed streaming algorithms for the most common data mining and machine learning tasks such as classification, clustering, and regression, as well as programming abstractions to develop new algorithms that run on top of distributed stream processing engines (DSPEs). It features a pluggable architecture that allows it to run on several DSPEs such as Apache Storm, Apache S4, and Apache Samza.
|SINGA||SINGA is a distributed deep learning platform. An intuitive programming model based on the layer abstraction is provided, which supports a variety of popular deep learning models. SINGA architecture supports both synchronous and asynchronous training frameworks. Hybrid training frameworks can also be customized to achieve good scalability. SINGA provides different neural net partitioning schemes for training large models.
|Storm||Storm is a distributed, fault-tolerant, and high-performance realtime computation system that provides strong guarantees on the processing of data. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language.
|SystemML||SystemML provides declarative large-scale machine learning (ML) that aims at flexible specification of ML algorithms and automatic generation of hybrid runtime plans ranging from single node, in-memory computations, to distributed computations such as Apache Hadoop MapReduce and Apache Spark.
I will have a closer look that the following SQL based machine learning languages in a lager blog post: