Over the past few days I’ve been doing a bit more playing around with Python, and create a word cloud. Yes there are lots of examples out there that show this, but none of them worked for me. This could be due to those examples using the older version of Python, libraries/packages no long exist, etc. There are lots of possible reasons. So I have to piece it together and the code given below is what I ended up with. Some steps could be skipped but this is what I ended up with.
Step 1 – Read in the data
In my example I wanted to create a word cloud for a website, so I picked my own blog for this exercise/example. The following code is used to read the website (a list of all packages used is given at the end).
import nltk from urllib.request import urlopen from bs4 import BeautifulSoup url = "http://www.oralytics.com/" html = urlopen(url).read() print(html)
The last line above, print(html), isn’t needed, but I used to to inspect what html was read from the webpage.
Step 2 – Extract just the Text from the webpage
The Beautiful soup library has some useful functions for processing html. There are many alternative ways of doing this processing but this is the approached that I liked.
The first step is to convert the downloaded html into BeautifulSoup format. When you view this converted data you will notices how everything is nicely laid out.
The second step is to remove some of the scripts from the code.
soup = BeautifulSoup(html) print(soup) # kill all script and style elements for script in soup(["script", "style"]): script.extract() # rip it out print(soup)
Step 3 – Extract plain text and remove whitespacing
The first line in the following extracts just the plain text and the remaining lines removes leading and trailing spaces, compacts multi-headlines and drops blank lines.
text = soup.get_text() print(text) # break into lines and remove leading and trailing space on each lines = (line.strip() for line in text.splitlines()) # break multi-headlines into a line each chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) # drop blank lines text = '\n'.join(chunk for chunk in chunks if chunk) print(text)
Step 4 – Remove stop words, tokenise and convert to lower case
As the heading says this code removes standard stop words for the English language, removes numbers and punctuation, tokenises the text into individual words, and then converts all words to lower case.
#download and print the stop words for the English language from nltk.corpus import stopwords #nltk.download('stopwords') stop_words = set(stopwords.words('english')) print(stop_words) #tokenise the data set from nltk.tokenize import sent_tokenize, word_tokenize words = word_tokenize(text) print(words) # removes punctuation and numbers wordsFiltered = [word.lower() for word in words if word.isalpha()] print(wordsFiltered) # remove stop words from tokenised data set filtered_words = [word for word in wordsFiltered if word not in stopwords.words('english')] print(filtered_words)
Step 5 – Create the Word Cloud
Finally we can create a word cloud backed on the finalised data set of tokenised words. Here we use the WordCloud library to create the word cloud and then the matplotlib library to display the image.
from wordcloud import WordCloud import matplotlib.pyplot as plt wc = WordCloud(max_words=1000, margin=10, background_color='white', scale=3, relative_scaling = 0.5, width=500, height=400, random_state=1).generate(' '.join(filtered_words)) plt.figure(figsize=(20,10)) plt.imshow(wc) plt.axis("off") plt.show() #wc.to_file("/wordcloud.png")
We get the following word cloud.
Step 6 – Word Cloud based on frequency counts
Another alternative when using the WordCloud library is to generate a WordCloud based on the frequency counts. For this you need to build up a table containing two items. The first item is the distinct token and the second column contains the number of times that word/token appears in the text. The following code shows this code and the code to generate the word cloud based on this frequency count.
from collections import Counter # count frequencies cnt = Counter() for word in filtered_words: cnt[word] += 1 print(cnt) from wordcloud import WordCloud import matplotlib.pyplot as plt wc = WordCloud(max_words=1000, margin=10, background_color='white', scale=3, relative_scaling = 0.5, width=500, height=400, random_state=1).generate_from_frequencies(cnt) plt.figure(figsize=(20,10)) plt.imshow(wc) #plt.axis("off") plt.show()
Now we get the following word cloud.
When you examine these word cloud to can easily guess what the main contents of my blog is about. Machine Learning, Oracle SQL and coding.
What Python Packages did I use?
Here are the list of Python libraries that I used in the above code. You can use PIP3 to install these into your environment.
nltk url open BeautifulSoup wordcloud Counter
If you used other languages, including Oracle PL/SQL, more than likely you will have experienced having to play buffering the number of records that are returned from a cursor. Typically this is needed when you are processing more than a few hundred records. The default buffering size is relatively small and by increasing the size of the number of records to be buffered can dramatically improve the performance of your code.
As with all things in coding and IT, the phrase “It Depends” applies here and changing the buffering size may not be what you need and my not help you to gain optimal performance for your code.
There are lots and lots of examples of how to test this in PL/SQL and other languages, but what I’m going to show you here in this blog post is to change the buffering size when using Python to process data in an Oracle Database using the Oracle Python library cx_Oracle.
Let us begin with taking the defaults and seeing what happens. In this first scenario the default buffering is used. Here we execute a query and the process the records in a FOR loop (yes these is a row-by-row, slow-by-slow approach.
import time i = 0 # define a cursor to use with the connection cur2 = con.cursor() # execute a query returning the results to the cursor print("Starting cursor at", time.ctime()) cur2.execute('select * from sh.customers') print("Finished cursor at", time.ctime()) # for each row returned to the cursor, print the record print("Starting for loop", time.ctime()) t0 = time.time() for row in cur2: i = i+1 if (i%10000) == 0: print(i,"records processed", time.ctime()) t1 = time.time() print("Finished for loop at", time.ctime()) print("Number of records counted = ", i) ttime = t1 - t0 print("in ", ttime, "seconds.")
This gives us the following output.
Starting cursor at 10:11:43 Finished cursor at 10:11:43 Starting for loop 10:11:43 10000 records processed 10:11:49 20000 records processed 10:11:54 30000 records processed 10:11:59 40000 records processed 10:12:05 50000 records processed 10:12:09 Finished for loop at 10:12:11 Number of records counted = 55500 in 28.398550033569336 seconds.
Processing the data this way takes approx. 28 seconds and this corresponds to the buffering of approx 50-75 records at a time. This involves many, many, many round trips to the the database to retrieve this data. This default processing might be fine when our query is only retrieving a small number of records, but as our data set or results set from the query increases so does the time it takes to process the query.
But we have a simple way of reducing the time taken, as the number of records in our results set increases. We can do this by increasing the number of records that are buffered. This can be done by changing the size of the ‘arrysize’ for the cursor definition. This reduces the number of “roundtrips” made to the database, often reducing networks load and reducing the number of context switches on the database server.
The following gives an example of same code with one additional line.
cur2.arraysize = 500
Here is the full code example.
# Test : Change the arraysize and see what impact that has import time i = 0 # define a cursor to use with the connection cur2 = con.cursor() cur2.arraysize = 500 # execute a query returning the results to the cursor print("Starting cursor at", time.ctime()) cur2.execute('select * from sh.customers') print("Finished cursor at", time.ctime()) # for each row returned to the cursor, print the record print("Starting for loop", time.ctime()) t0 = time.time() for row in cur2: i = i+1 if (i%10000) == 0: print(i,"records processed", time.ctime()) t1 = time.time() print("Finished for loop at", time.ctime()) print("Number of records counted = ", i) ttime = t1 - t0 print("in ", ttime, "seconds.")
Now the response time to process all the records is.
Starting cursor at 10:13:02
Finished cursor at 10:13:02
Starting for loop 10:13:02
10000 records processed 10:13:04
20000 records processed 10:13:06
30000 records processed 10:13:08
40000 records processed 10:13:10
50000 records processed 10:13:12
Finished for loop at 10:13:13
Number of records counted = 55500
in 11.780734777450562 seconds.
All done in just under 12 seconds, compared to 28 seconds previously.
Here is another alternative way of processing the data and retrieves the entire results set, using the ‘fetchall’ command, and stores it located in ‘res’.
Is Python the new R?
Maybe, maybe not, but that I’m finding in recent months is more companies are asking me to use Python instead of R for some of my work.
In this blog post I will walk through the steps of setting up the Oracle driver for Python, called cx_Oracle. The documentation for this drive is good and detailed with plenty of examples available on GitHub. Hopefully there isn’t anything new in this post, but it is my experiences and what I did.
1. Install Oracle Client
The Python driver requires Oracle Client software to be installed. Go here, download and install. It’s a straightforward install. Make sure the directories are added to the search path.
2. Download and install cx_Oracle
You can use pip3 to do this.
pip3 install cx_Oracle Collecting cx_Oracle Downloading cx_Oracle-6.1.tar.gz (232kB) 100% |████████████████████████████████| 235kB 679kB/s Building wheels for collected packages: cx-Oracle Running setup.py bdist_wheel for cx-Oracle ... done Stored in directory: /Users/brendan.tierney/Library/Caches/pip/wheels/0d/c4/b5/5a4d976432f3b045c3f019cbf6b5ba202b1cc4a36406c6c453 Successfully built cx-Oracle Installing collected packages: cx-Oracle Successfully installed cx-Oracle-6.1
3. Create a connection in Python
Now we can create a connection. When you see some text enclosed in angled brackets <>, you will need to enter your detailed for your schema and database server.
# import the Oracle Python library import cx_Oracle # define the login details p_username = "" p_password = "" p_host = "" p_service = "" p_port = "1521" # create the connection con = cx_Oracle.connect(user=p_username, password=p_password, dsn=p_host+"/"+p_service+":"+p_port) # an alternative way to create the connection # con = cx_Oracle.connect('/@/:1521') # print some details about the connection and the library print("Database version:", con.version) print("Oracle Python version:", cx_Oracle.version) Database version: 126.96.36.199.0 Oracle Python version: 6.1
4. Query some data and return results to Python
In this example the query returns the list of tables in the schema.
# define a cursor to use with the connection cur = con.cursor() # execute a query returning the results to the cursor cur.execute('select table_name from user_tables') # for each row returned to the cursor, print the record for row in cur: print("Table: ", row) Table: ('DECISION_TREE_MODEL_SETTINGS',) Table: ('INSUR_CUST_LTV_SAMPLE',) Table: ('ODMR_CARS_DATA',)
Now list the Views available in the schema.
# define a second cursor cur2 = con.cursor() # return the list of Views in the schema to the cursor cur2.execute('select view_name from user_views') # display the list of Views for result_name in cur2: print("View: ", result_name) View: ('MINING_DATA_APPLY_V',) View: ('MINING_DATA_BUILD_V',) View: ('MINING_DATA_TEST_V',) View: ('MINING_DATA_TEXT_APPLY_V',) View: ('MINING_DATA_TEXT_BUILD_V',) View: ('MINING_DATA_TEXT_TEST_V',)
5. Query some data and return to a Panda in Python
Pandas are commonly used for storing, structuring and processing data in Python, using a data frame format. The following returns the results from a query and stores the results in a panda.
# in this example the results of a query are loaded into a Panda # load the pandas library import pandas as pd # execute the query and return results into the panda called df df = pd.read_sql_query("SELECT * from INSUR_CUST_LTV_SAMPLE", con) # print the records returned by query and stored in panda print(df.head()) CUSTOMER_ID LAST FIRST STATE REGION SEX PROFESSION \ 0 CU13388 LEIF ARNOLD MI Midwest M PROF-2 1 CU13386 ALVA VERNON OK Midwest M PROF-18 2 CU6607 HECTOR SUMMERS MI Midwest M Veterinarian 3 CU7331 PATRICK GARRETT CA West M PROF-46 4 CU2624 CAITLYN LOVE NY NorthEast F Clerical BUY_INSURANCE AGE HAS_CHILDREN ... MONTHLY_CHECKS_WRITTEN \ 0 No 70 0 ... 0 1 No 24 0 ... 9 2 No 30 1 ... 2 3 No 43 0 ... 4 4 No 27 1 ... 4 MORTGAGE_AMOUNT N_TRANS_ATM N_MORTGAGES N_TRANS_TELLER \ 0 0 3 0 0 1 3000 4 1 1 2 980 4 1 3 3 0 2 0 1 4 5000 4 1 2 CREDIT_CARD_LIMITS N_TRANS_KIOSK N_TRANS_WEB_BANK LTV LTV_BIN 0 2500 1 0 17621.00 MEDIUM 1 2500 1 450 22183.00 HIGH 2 500 1 250 18805.25 MEDIUM 3 800 1 0 22574.75 HIGH 4 3000 2 1500 17217.25 MEDIUM [5 rows x 31 columns]
6. Wrapping it up and closing things
Finally we need to wrap thing up and close our cursors and our connection to the database.
# close the cursors cur2.close() cur.close() # close the connection to the database con.close()
Watch out for more blog posts on using Python with Oracle, Oracle Data Mining and Oracle R Enterprise.