Full map of ericbrown.com keyword matrix

Text Analytics and Visualization

For this post, I want to describe a text analytics and visualization technique using a basic keyword extraction mechanism using nothing but a word counter to find the top 3 keywords from a corpus of articles that I’ve created from my blog at http://ericbrown.com.  To create this corpus, I downloaded all of my blog posts (~1400 of them) and grabbed the text of each post. Then, I tokenize the post using nltk and various stemming / lemmatization techniques, count the keywords and take the top 3 keywords.  I then aggregate all keywords from all posts to create a visualization using Gephi.

I’ve uploaded a jupyter notebook with the full code-set for you to replicate this work. You can also get a subset of my blog articles in a csv file here.   You’ll need beautifulsoup and nltk installed. You can install them with:

pip install bs4 nltk

To get started, let’s load our libraries:

import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from string import punctuation
from collections import Counter
from collections import OrderedDict
import re
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
from HTMLParser import HTMLParser
from bs4 import BeautifulSoup

I’m loading warnings here because there’s a warning about “`BeautifulSoup“` that we can ignore.

Now, let’s set up some things we’ll need for this work.

First, let’s set up our stop words, stemmers and lemmatizers.

porter = PorterStemmer()
wnl = WordNetLemmatizer() 
stop = stopwords.words('english')
stop = set(stop)

Now, let’s set up some functions we’ll need.

The tokenizer function is taken from here.  If you want to see some cool topic modeling, jump over and read How to mine newsfeed data and extract interactive insights in Python…its a really good article that gets into topic modeling and clustering…which is something I’ll hit on here as well in a future post.

# From http://ahmedbesbes.com/how-to-mine-newsfeed-data-and-extract-interactive-insights-in-python.html
def tokenizer(text):
    tokens_ = [word_tokenize(sent) for sent in sent_tokenize(text)]
    tokens = []
    for token_by_sent in tokens_:
        tokens += token_by_sent
    tokens = list(filter(lambda t: t.lower() not in stop, tokens))
    tokens = list(filter(lambda t: t not in punctuation, tokens))
    tokens = list(filter(lambda t: t not in [u"'s", u"n't", u"...", u"''", u'``', u'\u2014', u'\u2026', u'\u2013'], tokens))
    filtered_tokens = []
    for token in tokens:
        token = wnl.lemmatize(token)
        if re.search('[a-zA-Z]', token):
    filtered_tokens = list(map(lambda token: token.lower(), filtered_tokens))
    return filtered_tokens

Next, I had some html in my articles, so i wanted to strip it from my text before doing anything else with it…here’s a class to do that using bs4.  I found this code on Stackoverflow.

class MLStripper(HTMLParser):
    def __init__(self):
        self.fed = []
    def handle_data(self, d):
    def get_data(self):
        return ''.join(self.fed)
def strip_tags(html):
    s = MLStripper()
    return s.get_data()

OK – now to the fun stuff. To get our keywords, we need only 2 lines of code. This function does a count and returns said count of keywords for us.

def get_keywords(tokens, num):
    return Counter(tokens).most_common(num)

Finally,  I created a function to take a pandas dataframe filled with urls/pubdate/author/text and then create my keywords from that.  This function  iterates over a pandas dataframe (each row is an article from my blog), tokenizes the ‘text’ from  and returns a pandas dataframe with keywords, the title of the article and the publication data of the article.

def build_article_df(urls):
    articles = []
    for index, row in urls.iterrows():
            data=row['text'].strip().replace("'", "")
            data = strip_tags(data)
            soup = BeautifulSoup(data)
            data = soup.get_text()
            data = data.encode('ascii', 'ignore').decode('ascii')
            document = tokenizer(data)
            top_5 = get_keywords(document, 5)
            unzipped = zip(*top_5)
            kw= list(unzipped[0])
            kw=",".join(str(x) for x in kw)
            articles.append((kw, row['title'], row['pubdate']))
        except Exception as e:
            print e
            #print data
    article_df = pd.DataFrame(articles, columns=['keywords', 'title', 'pubdate'])
    return article_df

Time to load the data and start analyzing. This bit of code loads in my blog articles (found here) and then grabs only the interesting columns from the data, renames them and prepares them for tokenization. Most of this can be done in one line when reading in the csv file, but I already had this written for another project and just it as is.

df = pd.read_csv('../examples/tocsv.csv')
data = []
for index, row in df.iterrows():
    data.append((row['Title'], row['Permalink'], row['Date'], row['Content']))
data_df = pd.DataFrame(data, columns=['title' ,'url', 'pubdate', 'text' ])

Taking the “`tail()“` of the dataframe gets us:

tail of article dataframe

Now, we can tokenize and do our word-count by calling our “`build_article_df“` function.

article_df = build_article_df(data_df)

This gives us a new dataframe with the top 3 keywords for each article (along with the pubdate and title of the article).

top 3 keywords per article






This is quite cool by itself. We’ve generated keywords for each article automatically using a simple counter. Not terribly sophisticated but it works and works well. There are many other ways to do this, but for now we’ll stick with this one. Beyond just having the keywords, it might be interesting to see how these keywords are ‘connected’ with each other and with other keywords. For example, how many times does ‘data’ shows up in other articles?

There are multiple ways to answer this question, but one way is by visualizing the keywords in a topology / network map to see the connections between keywords. we need to do a ‘count’ of our keywords and then build a co-occurrence matrix. This matrix is what we can then import into Gephi to visualize. We could draw the network map using networkx, but it tends to be tough to get something useful from that without a lot of work…using Gephi is much more user friendly.

We have our keywords and need a co-occurance matrix. To get there, we need to take a few steps to get our keywords broken out individually.

for index, row in article_df.iterrows():
    for kw in keywords:
        keywords_array.append((kw.strip(' '), row['keywords']))
kw_df = pd.DataFrame(keywords_array).rename(columns={0:'keyword', 1:'keywords'})

We now have a keyword dataframe “`kw_df“` that holds two columns: keyword and keywords with “`keyword“`

keyword dataframe






This doesn’t really make a lot of sense yet, but we need both columns to build a co-occurance matrix. We do by iterative over each document keyword list (the “`keywords“` column) and seeing if the “`keyword“` is included. If so, we added to our occurance matrix and then build our co-occurance matrix.

document = kw_df.keywords.tolist()
names = kw_df.keyword.tolist()
document_array = []
for item in document:
    items = item.split(',')
occurrences = OrderedDict((name, OrderedDict((name, 0) for name in names)) for name in names)
# Find the co-occurrences:
for l in document_array:
    for i in range(len(l)):
        for item in l[:i] + l[i + 1:]:
            occurrences[l[i]][item] += 1
co_occur = pd.DataFrame.from_dict(occurrences )

Now, we have a co-occurance matrix in the “`co_occur“` dataframe, which can be imported into Gephi to view a map of nodes and edges. Save the “`co_occur“` dataframe as a CSV file for use in Gephi (you can download a copy of the matrix here).


Over to Gephi

Now, its time to play around in Gephi. I’m a novice in the tool so can’t really give you much in the way of a tutorial, but I can tell you the steps you need to take to build a network map. First, import your co-occuance matrix csv file using File -> Import Spreadsheet and just leave everything at the default.  Then, in the ‘overview’ tab, you should see a bunch of nodes and connections like the image below.

network map of a subset of ericbrown.com articles
Network map of a subset of ericbrown.com articles

Next, move down to the ‘layout’ section and select the Fruchterman Reingold layout and push ‘run’ to get the map to redraw. At some point, you’ll need to press ‘stop’ after the nodes settle down on the screen. You should see something like the below.

redrawn nodes and edges
Network map of a subset of ericbrown.com articles


Cool, huh? Now…let’s get some color into this graph.  In the ‘appearance’ section, select ‘nodes’ and then ‘ranking’. Select “Degree’ and hit ‘apply’.  You should see the network graph change and now have some color associated with it.  You can play around with the colors if you want but the default color scheme should look something like the following:

colored Network map of a subset of ericbrown.com articles

Still not quite interesting though. Where’s the text/keywords?  Well…you need to swtich over to the ‘overview’ tab to see that. You should see something like the following (after selecting ‘Default Curved’ in the drop-down.

colored Network map of a subset of ericbrown.com articles

Now that’s pretty cool. You can see two very distinct areas of interest here. “Data’ and “Canon”…which makes sense since I write a lot about data and share a lot of my photography (taken with a Canon camera).

Here’s a full map of all 1400 of my articles if you are interested.  Again, there are two main clusters around photography and data but there’s also another large cluster around ‘business’, ‘people’ and ‘cio’, which fits with what most of my writing has been about over the years.

Full map of ericbrown.com keyword matrix

There are a number of other ways to visualize text analytics.  I’m planning a few additional posts to talk about some of the more interesting approaches that I’ve used and run across recently. Stay tuned.

If you want to learn more about Text analytics, check out these books:

Text Analytics with Python: A Practical Real-World Approach to Gaining Actionable Insights from your Data 

Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit

Text Mining with R


0 0 votes
Article Rating
Newest Most Voted
Inline Feedbacks
View all comments

[…] part of a tutorial on Text Analytics and Visualization I just finished  over  on technical blog called Python Data (where I blog about using python for […]


[…] Read the entire article at the source link… […]

Jose Rivera
Jose Rivera
6 years ago

How did you convert your blog post to CSV’s?

6 years ago

Thank you !
If you are using Python 3.x in order for the script to work these changes need to be made:

– from html.parser import HTMLParser (instead of from HTMLParser import HTMLParser)
– add HTMLParser.__init__(self) in MLStripper.__init__
– replace unzipped = zip(*top_5) with : unzipped = list(zip(*top_5)) in build_article_df

Ryan Aga
Ryan Aga
4 years ago


I encountered an error when I tried to replicate your code – particularly on this part: article_df = build_article_df(data_df). It gives me a message ‘list index out of range’.

My data contains text (verbatim data) and id fields from usual SQL data source. The beautifulsoup function is used for reading html/xml files.

Which parts of the ‘build_article_df’ function should I edit/delete?

Your response is highly appreciated.

Thank you.