Month: May 2018
The call for Papers (presentations) for the UKOUG Annual Conferences is open until 9am (UK time) on Monday 4th June.
Me: What are you waiting for? Go and submit a topic! Why not!
You: Humm, well…, (excuse, excuse, …)
You: I couldn’t do that! Present at a conference?
Me: Why not?
You: That is only for experts and I’m not one.
Me: Wrong! If you have a story to tell, then you can present.
You: But I’ve never presented before, it scares me, but one day I’d like to try.
Me: Go for it, do it. If you want you can co-present with me.
You: But, But, But …..
I’m sure you have experienced something like the above conversation before.
You don’t have to be an expert to present, you don’t have to know everything about a product to present, you don’t have to be using the latest and brightest technologies to present, you don’t have to present about something complex, etc. (and the list goes on and on)
The main thing to remember is, if you have a story to tell then that is your presentation. Be it simple, complex, only you might be interested in it, it involves making lots of bits of technology work, you use a particular application in a certain way, you found something interesting, you used a new process, etc (and the list goes on and on)
I’ve talked to people who “ranted” for two hours about a certain topic (its was about Dates in Oracle), but when I said you should give a presentation on that, they say NO, I couldn’t do that!. (If you are that person and you are reading this, then go on and submit that presentation).
If you don’t want to present alone, then reach out to someone else and ask them if they are interested in co-presenting. Most experienced presenters would be very happy to do this.
You: But the topic area I’ll talk about is not listed on the submission page?
Me: Good point, just submit it and pick the topic area that is closest.
You: But my topic would be of interest to the APPs and Tech conference, what do I do?
Me: Submit it to both, and let the agenda planners work out where it will fit.
I’ve presented at both APPs and Tech over the years and sometimess my Tech submission has been moved and accepted for the APPs conf, and vice versa.
(This is probably the first part of, probably, a five part blog series on twitter analytics using Python. Make sure to check out the other posts and I’ll post a wrap up blog post that will point to all the posts in the series)
(Yes there are lots of other examples out there, but I’ve put these notes together as a reminder for myself and a particular project I’m testing)
In this first blog post I will look at what you need to do get get your self setup for analysing Tweets, to harvest tweets and to do some basics. These are covered in the following five steps.
Step 1 – Setup your Twitter Developer Account & Codes
Before you can start writing code you need need to get yourself setup with Twitter to allow you to download their data using the Twitter API.
To do this you need to register with Twitter. To do this go to apps.twitter.com. Log in using your twitter account if you have one. If not then you need to go create an account.
Next click on the Create New App button.
Then give the Name of your app (Twitter Analytics using Python), a description, a webpage link (eg your blog or something else), click on the ‘add a Callback URL’ button and finally click the check box to agree with the Developer Agreement. Then click the ‘Create your Twitter Application’ button.
You will then get a web page like the following that contains lots of very important information. Keep the information on this page safe as you will need it later when creating your connection to Twitter.
The details contained on this web page (and below what is shown in the above image) will allow you to use the Twitter REST APIs to interact with the Twitter service.
Step 2 – Install libraries for processing Twitter Data
As with most languages there is a bunch of code and libraries available for you to use. Similarly for Python and Twitter. There is the Tweepy library that is very popular. Make sure to check out the Tweepy web site for full details of what it will allow you to do.
To install Tweepy, run the following.
pip3 install tweepy
It will download and install tweepy and any dependencies.
Step 3 – Initial Python code and connecting to Twitter
You are all set to start writing Python code to access, process and analyse Tweets.
The first thing you need to do is to import the tweepy library. After that you will need to use the important codes that were defined on the Twitter webpage produced in Step 1 above, to create an authorised connection to the Twitter API.
After you have filled in your consumer and access token values and run this code, you will not get any response.
Step 4 – Get User Twitter information
The easiest way to start exploring twitter is to find out information about your own twitter account. There is a API function called ‘me’ that gathers are the user object details from Twitter and from there you can print these out to screen or do some other things with them. The following is an example about my Twitter account.
#Get twitter information about my twitter account user = api.me() print('Name: ' + user.name) print('Twitter Name: ' + user.screen_name) print('Location: ' + user.location) print('Friends: ' + str(user.friends_count)) print('Followers: ' + str(user.followers_count)) print('Listed: ' + str(user.listed_count))
You can also start listing the last X number of tweets from your timeline. The following will take the last 10 tweets.
for tweets in tweepy.Cursor(api.home_timeline).items(10): # Process a single status print(tweets.text)
An alternative is, that returns only 20 records, where the example above can return X number of tweets.
public_tweets = api.home_timeline() for tweet in public_tweets: print(tweet.text)
Step 5 – Get Tweets based on a condition
Tweepy comes with a Search function that allows you to specify some text you want to search for. This can be hash tags, particular phrases, users, etc. The following is an example of searching for a hash tag.
for tweet in tweepy.Cursor(api.search,q="#machinelearning", lang="en", since="2018-05-01").items(10): print(tweet.created_at, tweet.text)
You can apply additional search criteria to include restricting to a date range, number of tweets to return, etc
Check out the other blog posts in this series of Twitter Analytics using Python.
Over the past few days I’ve been doing a bit more playing around with Python, and create a word cloud. Yes there are lots of examples out there that show this, but none of them worked for me. This could be due to those examples using the older version of Python, libraries/packages no long exist, etc. There are lots of possible reasons. So I have to piece it together and the code given below is what I ended up with. Some steps could be skipped but this is what I ended up with.
Step 1 – Read in the data
In my example I wanted to create a word cloud for a website, so I picked my own blog for this exercise/example. The following code is used to read the website (a list of all packages used is given at the end).
import nltk from urllib.request import urlopen from bs4 import BeautifulSoup url = "http://www.oralytics.com/" html = urlopen(url).read() print(html)
The last line above, print(html), isn’t needed, but I used to to inspect what html was read from the webpage.
Step 2 – Extract just the Text from the webpage
The Beautiful soup library has some useful functions for processing html. There are many alternative ways of doing this processing but this is the approached that I liked.
The first step is to convert the downloaded html into BeautifulSoup format. When you view this converted data you will notices how everything is nicely laid out.
The second step is to remove some of the scripts from the code.
soup = BeautifulSoup(html) print(soup) # kill all script and style elements for script in soup(["script", "style"]): script.extract() # rip it out print(soup)
Step 3 – Extract plain text and remove whitespacing
The first line in the following extracts just the plain text and the remaining lines removes leading and trailing spaces, compacts multi-headlines and drops blank lines.
text = soup.get_text() print(text) # break into lines and remove leading and trailing space on each lines = (line.strip() for line in text.splitlines()) # break multi-headlines into a line each chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) # drop blank lines text = '\n'.join(chunk for chunk in chunks if chunk) print(text)
Step 4 – Remove stop words, tokenise and convert to lower case
As the heading says this code removes standard stop words for the English language, removes numbers and punctuation, tokenises the text into individual words, and then converts all words to lower case.
#download and print the stop words for the English language from nltk.corpus import stopwords #nltk.download('stopwords') stop_words = set(stopwords.words('english')) print(stop_words) #tokenise the data set from nltk.tokenize import sent_tokenize, word_tokenize words = word_tokenize(text) print(words) # removes punctuation and numbers wordsFiltered = [word.lower() for word in words if word.isalpha()] print(wordsFiltered) # remove stop words from tokenised data set filtered_words = [word for word in wordsFiltered if word not in stopwords.words('english')] print(filtered_words)
Step 5 – Create the Word Cloud
Finally we can create a word cloud backed on the finalised data set of tokenised words. Here we use the WordCloud library to create the word cloud and then the matplotlib library to display the image.
from wordcloud import WordCloud import matplotlib.pyplot as plt wc = WordCloud(max_words=1000, margin=10, background_color='white', scale=3, relative_scaling = 0.5, width=500, height=400, random_state=1).generate(' '.join(filtered_words)) plt.figure(figsize=(20,10)) plt.imshow(wc) plt.axis("off") plt.show() #wc.to_file("/wordcloud.png")
We get the following word cloud.
Step 6 – Word Cloud based on frequency counts
Another alternative when using the WordCloud library is to generate a WordCloud based on the frequency counts. For this you need to build up a table containing two items. The first item is the distinct token and the second column contains the number of times that word/token appears in the text. The following code shows this code and the code to generate the word cloud based on this frequency count.
from collections import Counter # count frequencies cnt = Counter() for word in filtered_words: cnt[word] += 1 print(cnt) from wordcloud import WordCloud import matplotlib.pyplot as plt wc = WordCloud(max_words=1000, margin=10, background_color='white', scale=3, relative_scaling = 0.5, width=500, height=400, random_state=1).generate_from_frequencies(cnt) plt.figure(figsize=(20,10)) plt.imshow(wc) #plt.axis("off") plt.show()
Now we get the following word cloud.
When you examine these word cloud to can easily guess what the main contents of my blog is about. Machine Learning, Oracle SQL and coding.
What Python Packages did I use?
Here are the list of Python libraries that I used in the above code. You can use PIP3 to install these into your environment.
nltk url open BeautifulSoup wordcloud Counter
I almost forgot, but my 4th book has been published !
It is titled ‘Data Science’ and is published by MIT Press as part of their Essentials Knowledge series, and is co-written with John Kelleher.
It is available on Amazon in print, Kindle and Audio formats. Go check it out.
This book gives a concise introduction to the emerging field of data science, explaining its evolution, relation to machine learning, current uses, data infrastructure issues and ethical challenge the goal of data science is to improve decision making through the analysis of data. Today data science determines the ads we see online, the books and movies that are recommended to us online, which emails are filtered into our spam folders, even how much we pay for health insurance.
Go check it out.
One of the new features of the Autonomous Data Warehouse Cloud (ADWC) service is Oracle Machine Learning. This is a Zeppelin based notebook for your machine learning on ADWC. Check out my previous blog post about this.
In order to be able to use this new product and the in-database machine learning in ADWC, you will need your database user to have certain privileges. The first step in this is to create a typical user for accessing the ADWC and grant it the necessary OML privileges.
To do this open the ADWC console and then open the Service Console.
This will then open a new admin page which contains a link for ‘Manage Oracle ML User’. Click on this.
You can then enter the Username, Password and other details for the user, and then click Create.
This will then create a new user that is specific for Oracle Machine Learning. This new user will be granted the DWROLE, that contains the basic schema privileges and the privileges required to run the in-database machine learning algorithms. For those that a familiar with Oracle Data Mining/Oracle Advanced Analytics option in the Enterprise Edition of the Oracle database, you will see that these privileges are very similar.
You can examine the privileges granted to this DWROLE in the database as an administrator. When you do you will see the following:
CREATE ANALYTIC VIEW CREATE ATTRIBUTE DIMENSION ALTER SESSION CREATE HIERARCHY CREATE JOB CREATE MINING MODEL CREATE PROCEDURE CREATE SEQUENCE CREATE SESSION CREATE SYNONYM CREATE TABLE CREATE TRIGGER CREATE TYPE CREATE VIEW READ,WRITE ON directory DATA_PUMP_DIR EXECUTE privilege on the PL/SQL package DBMS_CLOUD