Category: python

Full map of ericbrown.com keyword matrix

Text Analytics and Visualization

For this post, I want to describe a text analytics and visualization technique using a basic keyword extraction mechanism using nothing but a word counter to find the top 3 keywords from a corpus of articles that I’ve created from my blog at http://ericbrown.com.  To create this corpus, I downloaded all of my blog posts (~1400 of them) and grabbed the text of each post. Then, I tokenize the post using nltk and various stemming / lemmatization techniques, count the keywords and take the top 3 keywords.  I then aggregate all keywords from all posts to create a visualization using Gephi.

I’ve uploaded a jupyter notebook with the full code-set for you to replicate this work. You can also get a subset of my blog articles in a csv file here.   You’ll need beautifulsoup and nltk installed. You can install them with:

To get started, let’s load our libraries:

I’m loading warnings here because there’s a warning about BeautifulSoup that we can ignore.

Now, let’s set up some things we’ll need for this work.

First, let’s set up our stop words, stemmers and lemmatizers.

Now, let’s set up some functions we’ll need.

The tokenizer function is taken from here.  If you want to see some cool topic modeling, jump over and read How to mine newsfeed data and extract interactive insights in Python…its a really good article that gets into topic modeling and clustering…which is something I’ll hit on here as well in a future post.

Next, I had some html in my articles, so i wanted to strip it from my text before doing anything else with it…here’s a class to do that using bs4.  I found this code on Stackoverflow.

OK – now to the fun stuff. To get our keywords, we need only 2 lines of code. This function does a count and returns said count of keywords for us.

Finally,  I created a function to take a pandas dataframe filled with urls/pubdate/author/text and then create my keywords from that.  This function  iterates over a pandas dataframe (each row is an article from my blog), tokenizes the ‘text’ from  and returns a pandas dataframe with keywords, the title of the article and the publication data of the article.

Time to load the data and start analyzing. This bit of code loads in my blog articles (found here) and then grabs only the interesting columns from the data, renames them and prepares them for tokenization. Most of this can be done in one line when reading in the csv file, but I already had this written for another project and just it as is.

Taking the tail() of the dataframe gets us:

tail of article dataframe

Now, we can tokenize and do our word-count by calling our build_article_df function.

This gives us a new dataframe with the top 3 keywords for each article (along with the pubdate and title of the article).

top 3 keywords per article

 

 

 

 

 

This is quite cool by itself. We’ve generated keywords for each article automatically using a simple counter. Not terribly sophisticated but it works and works well. There are many other ways to do this, but for now we’ll stick with this one. Beyond just having the keywords, it might be interesting to see how these keywords are ‘connected’ with each other and with other keywords. For example, how many times does ‘data’ shows up in other articles?

There are multiple ways to answer this question, but one way is by visualizing the keywords in a topology / network map to see the connections between keywords. we need to do a ‘count’ of our keywords and then build a co-occurrence matrix. This matrix is what we can then import into Gephi to visualize. We could draw the network map using networkx, but it tends to be tough to get something useful from that without a lot of work…using Gephi is much more user friendly.

We have our keywords and need a co-occurance matrix. To get there, we need to take a few steps to get our keywords broken out individually.

We now have a keyword dataframe kw_df that holds two columns: keyword and keywords with keyword

keyword dataframe

 

 

 

 

 

This doesn’t really make a lot of sense yet, but we need both columns to build a co-occurance matrix. We do by iterative over each document keyword list (the keywords column) and seeing if the keyword is included. If so, we added to our occurance matrix and then build our co-occurance matrix.

Now, we have a co-occurance matrix in the co_occur dataframe, which can be imported into Gephi to view a map of nodes and edges. Save the co_occur dataframe as a CSV file for use in Gephi (you can download a copy of the matrix here).

Over to Gephi

Now, its time to play around in Gephi. I’m a novice in the tool so can’t really give you much in the way of a tutorial, but I can tell you the steps you need to take to build a network map. First, import your co-occuance matrix csv file using File -> Import Spreadsheet and just leave everything at the default.  Then, in the ‘overview’ tab, you should see a bunch of nodes and connections like the image below.

network map of a subset of ericbrown.com articles
Network map of a subset of ericbrown.com articles

Next, move down to the ‘layout’ section and select the Fruchterman Reingold layout and push ‘run’ to get the map to redraw. At some point, you’ll need to press ‘stop’ after the nodes settle down on the screen. You should see something like the below.

redrawn nodes and edges
Network map of a subset of ericbrown.com articles

 

Cool, huh? Now…let’s get some color into this graph.  In the ‘appearance’ section, select ‘nodes’ and then ‘ranking’. Select “Degree’ and hit ‘apply’.  You should see the network graph change and now have some color associated with it.  You can play around with the colors if you want but the default color scheme should look something like the following:

colored Network map of a subset of ericbrown.com articles

Still not quite interesting though. Where’s the text/keywords?  Well…you need to swtich over to the ‘overview’ tab to see that. You should see something like the following (after selecting ‘Default Curved’ in the drop-down.

colored Network map of a subset of ericbrown.com articles

Now that’s pretty cool. You can see two very distinct areas of interest here. “Data’ and “Canon”…which makes sense since I write a lot about data and share a lot of my photography (taken with a Canon camera).

Here’s a full map of all 1400 of my articles if you are interested.  Again, there are two main clusters around photography and data but there’s also another large cluster around ‘business’, ‘people’ and ‘cio’, which fits with what most of my writing has been about over the years.

Full map of ericbrown.com keyword matrix

There are a number of other ways to visualize text analytics.  I’m planning a few additional posts to talk about some of the more interesting approaches that I’ve used and run across recently. Stay tuned.


If you want to learn more about Text analytics, check out these books:

Text Analytics with Python: A Practical Real-World Approach to Gaining Actionable Insights from your Data 

Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit

Text Mining with R


 

Eric D. Brown , D.Sc. has a doctorate in Information Systems with a specialization in Data Sciences, Decision Support and Knowledge Management. He writes about utilizing python for data analytics at pythondata.com and the crossroads of technology and strategy at ericbrown.com

Text Analytics with Python – A book review

Text Analytics with Python: A Practical Real-World Approach to Gaining Actionable Insights from your DataThis is a book review of Text Analytics with Python: A Practical Real-World Approach to Gaining Actionable Insights from your Data by Dipanjan Sarkar

One of my go-to books for natural language processing with Python has been Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit by Steven Bird, Ewan Klein, and Edward Loper.  This has been the book for me and was one of my dissertation references.  I used this book so much, that I I had to buy a second copy of this book because I wore the first one out.  I’ve read many other NLP books but haven’t found any that could match this book – till now.

Text Analytics with Python: A Practical Real-World Approach to Gaining Actionable Insights from your Data by Dipanjan Sarkar is a fantastic book and has now taken a permanent place on my bookshelf.

Unlike many books that I run across, this book spends plenty of time talking about the theory behind things rather than just doing some hand-waving and then showing some code. In fact, there isn’t any code (that I saw) until page 41. That’s impressive these days.   Here’s a quick overview of the book’s layout:

  • Chapter 1 provides the baseline for Natural Language. This is a very good overview for anyone that’s never worked much with NLP.
  • Chapter 2 is a python ‘refresher’. If you don’t know python at all but know some other language, this should get you started enough to use the rest of the book.
  • Chapter’s 3 – 7 is there the real fun begins. These chapters cover Text Classification, Summarization Similarity / Clustering and Semantic / Sentiment Analysis.

If you have some familiarity with python and NLP, you can jump to Chapter 3 and dive into the details.

What I really like about this book is that it places theory first.  I’m a big fan of ‘learning by doing’ but I think before you can ‘do’ you need to know ‘why’ you are doing what you are doing.  The code in the book is really well done as well and uses the NLTK,  Sklearn and gensim libraries for most of the work. Additionally, there are multiple ‘build your own’ sections where the author provides a very good overview (and walk-through) of what it takes to build your own functionality for your own NLP work.

This book is highly recommended.


Links in this post:

Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit by Steven Bird, Ewan Klein, and Edward Loper.

Text Analytics with Python: A Practical Real-World Approach to Gaining Actionable Insights from your Data by Dipanjan Sarkar

 

Eric D. Brown , D.Sc. has a doctorate in Information Systems with a specialization in Data Sciences, Decision Support and Knowledge Management. He writes about utilizing python for data analytics at pythondata.com and the crossroads of technology and strategy at ericbrown.com

Python and AWS Lambda – A match made in heaven

In recent months, I’ve begun moving some of my analytics functions to the cloud. Specifically, I’ve been moving them many of my python scripts and API’s to AWS’ Lambda platform using the Zappa framework.  In this post, I’ll share some basic information about Python and AWS Lambda…hopefully it will get everyone out there thinking about new ways to use platforms like Lambda.

Before we dive into an example of what I’m moving to Lambda, let’s spend some time talking about Lambda. When I first heard about, I was a confused…but once I ‘got’ it, I saw the value. Here’s the description of Lambda from AWS’ website:

AWS Lambda lets you run code without provisioning or managing servers. You pay only for the compute time you consume – there is no charge when your code is not running. With Lambda, you can run code for virtually any type of application or backend service – all with zero administration. Just upload your code and Lambda takes care of everything required to run and scale your code with high availability. You can set up your code to automatically trigger from other AWS services or call it directly from any web or mobile app.

Once I realized how easy it is to move code to lambda to use whenever/wherever I needed it, I jumped at the opportunity.  But…it took a while to get a good workflow in place to simplify deploying to lambda. I stumbled across Zappa and couldn’t be happier…it makes deploying to lambda simple (very simple).

OK.  So. Why would you want to move your code to Lambda?

Lots of reasons. Here’s a few:

  • Rather than host your own server to handle some API endpoints — move to Lambda
  • Rather than build out a complex development environment to support your complex system, move some of that complexity to Lambda and make a call to an API endpoint.
  • If you travel and want to downsize your travel laptop but still need to access your python data analytics stack move the stack to Lambda.
  • If you have a script that you run very irregularly and don’t want to pay $5 a month at Digital Ocean — move it to Lambda.

There are many other more sophisticated reasons of course, but these’ll do for now.

Let’s get started looking at python and AWS Lambda.  You’ll need an AWS account for this.

First – I’m going to talk a bit about building an API endpoint using Flask. You don’t have to use flask, but its an easy framework to use and you can quickly build an API endpoint with it with very little fuss.  With this example, I’m going to use Lambda to host an API endpoint that uses the Newspaper library to scrape a website, pull down the text and return that text to my local script.

Writing your first Flask + Lambda API

To get started, install Flask,Flask-Restful and Zappa.  You’ll want to do this in a fresh environment using virtualenv (see my previous posts about virtualenv and vagrant) because we’ll be moving this up to Lambda using Zappa.

Our flask driven API is going to be extremely simple and exist in less than 20 lines of code:

Note: The ‘host = 0.0.0.0’ and ‘port=50001’ are extranous and are how I use Flask with vagrant. If you keep this in and run it locally, you’d need to visit http://0.0.0.0:5001 to view your app.

The last thing you need to do is build your requirements.txt file for Zappa to use when building your application files to send to Lambda. For a quick/dirty requirements file, I used the following:

Now…let’s get this up to lambda.  With zappa, its as easy as a couple of command line instructions.

First, run the init command from the command line in your virtualenv:

You should see something similar to this:

zappa init screenshot

You’ll be asked a few questions, you can hit ‘enter’ to take the defaults or enter your own. For this eample, I used ‘dev’ for the environment name (you can set up multiple environments for dev, staging, production, etc) and made a S3 bucket for use with this application.

Zappa should realize you are working with Flask app and automatically set things up for you. It will ask you what the name of your Flask app’s main function is (in this case it is api.app). Lastly, Zappa will ask if you want to deploy to all AWS regions…I chose not to for this example. Once complete, you’ll have a zappa_settings.json file in your directory that will look something like the following:

I’ve found that I need to add more information to this json file before I can successfully deploy. For some reason, Zappa doesn’t add the “region” to the settings file. I also like to add the “runtime” as well. Edit your json file to read (feel free to use whatever region you want):

Now…you are ready to deploy. You can do that with the following command:

Zappa will set up all the necessary configurations and systems on AWS AND zip up your libraries and code and push it to Lambda.   I’ve not found another framework as easy to use as Zappa when it comes to deploying…if you know of one feel free to leave a comment.

After a minute or two, you should see a “Deployment Complete: …” message that includes the endpoint for your new API. In this case, Zappa built the following endpoint for me:

If you make some changes to your code and need to update Lambda, Zappa makes it easy to do that with the following command:

Additionally, if you want to add a ‘production’ lambda environment, all you need to do is add that new environment to your settings json file and deploy it. For this example, our settings file would change to:

Next, do a deploy prod and your production environment is ready to go at a new endpoint.

Interfacing with the API

Our code is pushed to Lambda and ready to start accepting requests.  In this example’s case, all we are doing is returning “hello world” but you can see the power in this for other functionality.  To check out the results, just open a browser and enter your Zappa Deployment URL and append /hello to the end of it like this:

You should see the standard “Hello World” response in your browser window.

You can find the code for the lambda api.py function here.

Note: At some point, I’ll pull this endpoint down…but will leave it up for a bit for users to play around with.


 

If you want to learn more about Lambda, there are two fairly good books on the topic – check them out (Amazon links):


 

Eric D. Brown , D.Sc. has a doctorate in Information Systems with a specialization in Data Sciences, Decision Support and Knowledge Management. He writes about utilizing python for data analytics at pythondata.com and the crossroads of technology and strategy at ericbrown.com