Ireland

#GE2020 Analysing Party Manifestos using Python

Posted on

The general election is underway here in Ireland with polling day set for Saturday 8th February. All the politicians are out campaigning and every day the various parties are looking for publicity on whatever the popular topic is for that day. Each day is it a different topic.

Most of the political parties have not released their manifestos for the #GE2020 election (as of date of this post). I want to use some simple Python code to perform some analyse of their manifestos. As their new manifestos weren’t available (yet) I went looking for their manifestos from the previous general election. Michael Pidgeon has a website with party manifestos dating back to the early 1970s, and also has some from earlier elections. Check out his website.

I decided to look at manifestos from the 4 main political parties from the 2016 general election. Yes there are other manifestos available, and you can use the Python code, given below to analyse those, with only some minor edits required.

The end result of this simple analyse is a WordCloud showing the most commonly used words in their manifestos. This is graphical way to see what some of the main themes and emphasis are for each party, and also allows us to see some commonality between the parties.

Let’s begin with the Python code.

1 – Initial Setup

There are a number of Python Libraries available for processing PDF files. Not all of them worked on all of the Part Manifestos PDFs! It kind of depends on how these files were generated. In my case I used the pdfminer library, as it worked with all four manifestos. The common library PyPDF2 didn’t work with the Fine Gael manifesto document.

import io
import pdfminer
from pprint import pprint
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage

#directory were manifestos are located
wkDir = '.../General_Election_Ire/'

#define the names of the Manifesto PDF files & setup party flag
pdfFile = wkDir+'FGManifesto16_2.pdf'
party = 'FG'
#pdfFile = wkDir+'Fianna_Fail_GE_2016.pdf'
#party = 'FF'
#pdfFile = wkDir+'Labour_GE_2016.pdf'
#party = 'LB'
#pdfFile = wkDir+'Sinn_Fein_GE_2016.pdf'
#party = 'SF'

All of the following code will run for a given manifesto. Just comment in or out the manifesto you are interested in. The WordClouds for each are given below.

2 – Load the PDF File into Python

The following code loops through each page in the PDF file and extracts the text from that page.

I added some addition code to ignore pages containing the Irish Language. The Sinn Fein Manifesto contained a number of pages which were the Irish equivalent of the preceding pages in English. I didn’t want to have a mixture of languages in the final output.

SF_IrishPages = [14,15,16,17,18,19,20,21,22,23,24]
text = ""

pageCounter = 0
resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle)
page_interpreter = PDFPageInterpreter(resource_manager, converter)

for page in PDFPage.get_pages(open(pdfFile,'rb'), caching=True, check_extractable=True):
    if (party == 'SF') and (pageCounter in SF_IrishPages):
        print(party+' - Not extracting page - Irish page', pageCounter)
    else:
        print(party+' - Extracting Page text', pageCounter)
        page_interpreter.process_page(page)

        text = fake_file_handle.getvalue()

    pageCounter += 1

print('Finished processing PDF document')
converter.close()
fake_file_handle.close()
FG - Extracting Page text 0
FG - Extracting Page text 1
FG - Extracting Page text 2
FG - Extracting Page text 3
FG - Extracting Page text 4
FG - Extracting Page text 5
...

3 – Tokenize the Words

The next step is to Tokenize the text. This breaks the text into individual words.

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
tokens = []

tokens = word_tokenize(text)

print('Number of Pages =', pageCounter)
print('Number of Tokens =',len(tokens))
Number of Pages = 140
Number of Tokens = 66975

4 – Filter words, Remove Numbers & Punctuation

There will be a lot of things in the text that we don’t want included in the analyse. We want the text to only contain words. The following extracts the words and ignores numbers, punctuation, etc.

#converts to lower case, and removes punctuation and numbers
wordsFiltered = [tokens.lower() for tokens in tokens if tokens.isalpha()]
print(len(wordsFiltered))
print(wordsFiltered)
58198
['fine', 'gael', 'general', 'election', 'manifesto', 's', 'keep', 'the', 'recovery', 'going', 'gaelgeneral', 'election', 'manifesto', 'foreward', 'from', 'an', 'taoiseach', 'the', 'long', 'term', 'economic', 'three', 'steps', 'to', 'keep', 'the', 'recovery', 'going', 'agriculture', 'and', 'food', 'generational',
...

As you can see the number of tokens has reduced from 66,975 to 58,198.

5 – Setup Stop Words

Stop words are general words in a language that doesn’t contain any meanings and these can be removed from the data set. Python NLTK comes with a set of stop words defined for most languages.

#We initialize the stopwords variable which is a list of words like 
#"The", "I", "and", etc. that don't hold much value as keywords
stop_words = stopwords.words('english')
print(stop_words)
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself',
....

Additional stop words can be added to this list. I added the words listed below. Some of these you might expect to be in the stop word list, others are to remove certain words that appeared in the various manifestos that don’t have a lot of meaning. I also added the name of the parties  and some Irish words to the stop words list.

#some extra stop words are needed after examining the data and word cloud
#these are added
extra_stop_words = ['ireland','irish','ł','need', 'also', 'set', 'within', 'use', 'order', 'would', 'year', 'per', 'time', 'place', 'must', 'years', 'much', 'take','make','making','manifesto','ð','u','part','needs','next','keep','election', 'fine','gael', 'gaelgeneral', 'fianna', 'fáil','fail','labour', 'sinn', 'fein','féin','atá','go','le','ar','agus','na','ár','ag','haghaidh','téarnamh','bplean','page','two','number','cothromfor']
stop_words.extend(extra_stop_words)
print(stop_words)

Now remove these stop words from the list of tokens.

# remove stop words from tokenised data set
filtered_words = [word for word in wordsFiltered if word not in stop_words]
print(len(filtered_words))
print(filtered_words)
31038
['general', 'recovery', 'going', 'foreward', 'taoiseach', 'long', 'term', 'economic', 'three', 'steps', 'recovery', 'going', 'agriculture', 'food',

The number of tokens is reduced to 31,038

6 – Word Frequency Counts

Now calculate how frequently these words occur in the list of tokens.

#get the frequency of each word
from collections import Counter

# count frequencies
cnt = Counter()
for word in filtered_words:
cnt[word] += 1

print(cnt)
Counter({'new': 340, 'support': 249, 'work': 190, 'public': 186, 'government': 177, 'ensure': 177, 'plan': 176, 'continue': 168, 'local': 150, 
...

7 – WordCloud

We can use the word frequency counts to add emphasis to the WordCloud. The more frequently it occurs the larger it will appear in the WordCloud.

#create a word cloud using frequencies for emphasis 
from wordcloud import WordCloud
import matplotlib.pyplot as plt

wc = WordCloud(max_words=100, margin=9, background_color='white',
scale=3, relative_scaling = 0.5, width=500, height=400,
random_state=1).generate_from_frequencies(cnt)

plt.figure(figsize=(20,10))
plt.imshow(wc)
#plt.axis("off")
plt.show()

#Save the image in the img folder:
wc.to_file(wkDir+party+"_2016.png")

The last line of code saves the WordCloud image as a file in the directory where the manifestos are located.

8 – WordClouds for Each Party

Screenshot 2020-01-21 11.10.25

Remember these WordClouds are for the manifestos from the 2016 general election.

When the parties have released their manifestos for the 2020 general election, I’ll run them through this code and produce the WordClouds for 2020. It will be interesting to see the differences between the 2016 and 2020 manifesto WordClouds.

OUG Ireland Meetup 11th May

Posted on Updated on

The next OUG Ireland Meetup is happening on 11th May, in the Bank of Ireland Grand Canal Dock. This is a free event and is open to every one. You don’t have to be a member to attend.

Following on from a very successful 2 day OUG Ireland Conference with over 250 attendees, we have organised our next meetup. This was mentioned during the opening session of the conference.

NewImage

We typically have 2 presentations at each Meetup and on 11th May we have:

1. Oracle Analytics Cloud Service.

Oralce Analytics Cloud Service was only released a few weeks ago and we some local people who have been working with the beta and early adopter releases. They will be giving us some insights on this new product and how it compares with other analytics products like Oracle Data Visualization and OBIEE.

Running Oracle DataGuard on RAC on Oracle 12c

The second presentation will be on using Oracle DataGuard on RAC on Oracle 12c. We have a very experienced DBA talking about his experiences of using these products how to workaround some key bugs and situations to be aware of for administration purposes. Lots of valuable information to be gained.

Check out the full agenda and to register for the Meetup by clicking on this link or on the Meetup image above.

There will be some food and refreshments available for you to enjoy.

The Meetup will be in Bank of Ireland, Grand Canal Dock. This venue is a very popular locations for Meetups in Dublin.

NewImage

OUG Ireland 2016: APPs Track highlights

Posted on

Today I was joined by Debra Lilly who is the APPs track lead and conference chair for OUG Ireland. Debra lets us know what we can look forward to on the APPs track at this years conference.

Check out Debra’s video.

Click on the image below to get more details of the agenda and to register for this 2 day conference.

NewImage

Follow the conference and OUG Ireland on twitter using #oug_ire

OUG Ireland 2015 is next week

Posted on

The annual OUG Ireland conference is on next week on Thursday 19th March.

If you haven’t already signed up for the conference this is only a few days left to do so. Click here to go to the registrations pages.

Also don’t forget to sign Maria Colgan’s one day seminar on the Oracle 12c In-Memory Option.

As always there is a very full agenda with 7 streams, 47 presentations and several keynote presentations.

I’ll be a draw for a copy of my book and I’ll be giving away a few Oracle Press goodies too. Check out this blog post for the details and rules of the book draw.

The following are the presentations I’m planning on attending (so you know where to find me)

table.myTable { border-collapse:collapse; } table.myTable td, table.myTable th { border:1px solid black;padding:5px; }

Time Presenter Topic
09:10-09:30 Debra Lilley OUG Ireland Welcome, Introduction and Opening
09:30-10:10 Jon Paul (Oracle) Opening Keynote by Jon Paul from Oracle
10:15-11:00 Oralce Presentation Oracle Big Data Strategy

11:00-11:25 Exhibition Hall
11:25-12:10 Antony Heljula Real Business Value Using Predictive BI

(I’ve seen this before but I worked with Antony on some of what he will be talking about)

12:15-13:00 Roel Hartman &

Brendan Tierney

What Are They Thinking? With Oracle Application Express & Oracle Data Mining.

(we gave this presentation at Oracle Open World back in September 2014)

12:15-13:00 Gurcan Orhan How to handle Dev, Test & Prod with ODI
13:00-14:00 Lunch

(and then freaking out before I give my second presentation)

14:00-14:45 Brendan Tierney Predictive Queries in Oracle 12c Database

(I suppose I have to turn up to my own presentation)

14:50-15:35 Roel Hartman Hidden APEX 5 Gems Revealed

(APEX 5 is due out any day now)

15:35-16:00 Exhibition Hall & Coffee

(and then freaking out before I give my third presentation)

16:00-16:45 Brendan Tierney Running R in your Oracle Database using Oracle R Enterprise

(This presentation generally runs for 50 minutes)

16:50-17:35 Maria Colgan BI, Dev & Tech Closing Keynote: Oracle Database In-Memory-The next big thing
17:35-18:35 Event Social i.e. free drink 🙂

As you can see it is going to be a busy, busy day.

I would love to attend lots of others, but being able to be in multiple places at the same time is not one of them.

NOTE:The User Group has a rule that a presenter can have a max of 2 presentations. Unfortunately we had to break this rule a week out from the conference, due to some cancellations. And that is why I’ve ended with 2.5 presentations.

Book give away at OUG Ireland

Posted on

The annual Oracle User Group in Ireland conference is on the 19th March in Croke Park.

I’ll be giving 2 presentations, with one each on the Development and Business Analytics tracks. Here are the details of these presentations.

table.myTable { border-collapse:collapse; } table.myTable td, table.myTable th { border:1px solid black;padding:5px; }

Time Room Presentation Title / Topic
14:00-14:45 InterConnect 681 Predictive Queries in Oracle 12c
16:00-16:45 Davin Suite Running R in the Database using Oracle R Enterprise

I will be giving away a copy of my book to one luck person 🙂

How will this book give away work?

During both of my presentations I will pass around a “hat” for you to put your name or business card into. Then at end of my last presentation we will draw one name out of the hat.

But you have to be in the room to collect the book. If you are not there then I will draw out another name (and so on) until the winner is in the room.

So by attending both of my presentations you are doubling your chances of winning my book.

(Maybe this is an attempt by me to have a good attendance at my last presentation)

Book Cover

Plus I might have a few other Oracle Press goodies to give away too.

Oracle ACEs at OUG Ireland 2015

Posted on

The annual Oracle User Group in Ireland Conference will be on Thursday 19th March. This year the conference will be held in the Croke Park conference centre. This conference centre is only a short taxi ride from Dublin Airport and Dublin City Centre.

If you are planning a hotel stay for the conference I would recommend staying in a hotel in the city centre and get a taxi to/from the conference venue.

We have a large number of Oracle ACEs presenting at the conference. The following table lists the ACEs, their twitter handle and their website.

table.myTable { border-collapse:collapse; } table.myTable td, table.myTable th { border:1px solid black;padding:5px; }

Oracle ACE Type of ACE Twitter Name Blog / Web Site
Brendan Tierney ACE Director @brendantierney http://www.oralytics.com
Debra Lilley ACE Director @debralilley http://www.debrasoracle.blogspot.ie/
Jonathan Lewis ACE Director @JLOracle http://jonathanlewis.wordpress.com/
Tim Hall ACE Director @oraclebase http://oracle-base.com/
Alex Nuijten ACE Director @alexnuijten http://nuijten.blogspot.com/
Dhananjay Papde ACE Associate
Stewart Bryson ACE Director @stewartbryson http://www.redpillanalytics.com/
Antony Heljula ACE @aheljula http://www.peakindicators.com/
Gurcan Orhan ACE Director @gurcan_orhan https://gurcanorhan.wordpress.com/
Heli Helskyaho ACE Director @HeliFromFinland https://helifromfinland.wordpress.com/
Marco Gralike ACE Director @mgralike http://www.xmldb.nl
Roel Hartman ACE Director @roelh http://roelhartman.blogspot.com/
Martin Widlake ACE @mdwidlake https://mwidlake.wordpress.com/
Liron Amitzi ACE @amitzil http://www.dbaces.com/
David Kurtz ACE Director @davidmkurtz http://www.go-faster.co.uk/
Marcin Przepiorowski ACE @pioro http://oracleprof.blogspot.com/

Make sure you check out the full agenda for the conference by clicking on the following image. Plus there is a full day session on Friday 20th March with Maria Colgan on the Oracle In-Memory option.

Ougire15 hp cfp v2

OUG Ireland 2015 : Now open for Submissions

Posted on

OUG Ireland Call for submissions is now open.

The closing date for submissions is 5th January, 2015.

and the submission webpage can be found here.

Ougire15 hp cfp v2

The OUG Ireland conference will be on Thursday 19th March. Yes it is only a one day conference 😦 but we will be 5 or 6 or more streams. So there will be something for everyone and plenty of choice.

On Friday 20th March we will have Maria Colgan, formally the Optimizer Lady and now the In-Memory Queen (or something like that), giving a full day workshop on the In-Database option and the Optimizer. She will also be about for the main conference on the 19th, so you can expect a presentation or two from her on the Thursday.

Agenda selection day is the 8th January, 2015. So hopefully you will be getting the acceptance emails soon after that or during week of 12th January.

There is a committee of about 10 people who are involved in selecting presentations and setting the agenda. If it was up to me then I would accept everything/everyone. So if your presentation is not accepted this time, please don’t blame me 🙂 I said YES to your presentation, I really, really did. I fought so hard to have your presentation included. If your presentation is not accepted then the blame is down to the other committee members 🙂

The conference will be held in Croke Park, and is a 15-20 minute taxi ride from the Airport.

You can follow the Conference and other OUG Ireland events using the twitter tag #oug_ire