Collecting / Storing Tweets with Python and MongoDB

A good amount of the work that I do involves using social media content for analyzing networks, sentiment, influencers and other various types of analysis.

In order to do this type of analysis, you first need to have some data to analyze.  You can also scrape websites like Twitter or Facebook using simple web scrapers, but I’ve always found it easier to use the API’s that these companies / websites provide to pull down data.

The Twitter Streaming API is ideal for grabbing data in real-time and storing it for analysis. Twitter also has a search API that lets you pull down a certain number of historical tweets (I think I read it was the last 1,000 tweets…but its been a while since I’ve looked at the Search API).   I’m a fan of the Streaming API because it lets me grab a much larger set of data than the Search API, but it requires you to build a script that ‘listens’ to the API for your required keywords and then store those tweets somewhere for later analysis.

There are tons of ways to connect up to the Streaming API. There are also quite a few Twitter API wrappers for Python (and most of them work very well).   I tend to use Tweepy more than others due to its ease of use and simple structure. Additionally, if I’m working on a small / short-term project, I tend to reach for MongoDB to store the tweets using the PyMongo module. For larger / longer-term projects I usually connect the streaming API script to MySQL instead of MongoDB simply because MySQL fits into my ecosystem of backup scripts, etc better than MongoDB does.  MongoDB is perfectly suited for this type of work for larger projects…I just tend to swing toward MySQL for those projects.

For this post, I wanted to share my script for collecting Tweets from the Twitter API and storing them into MongoDB.

Note: This script is a mashup of many other scripts I’ve found on the web over the years. I don’t recall where I found the pieces/parts of this script but I don’t want to discount the help I had from other people / sites in building this script.

Collecting / Storing Tweets with Python and MongoDB

Let’s set up our imports:

Next, set up your mongoDB path:

Next, set up the words that you want to ‘listen’ for on Twitter. You can use words or phrases seperated by commas.

Here, I’m listening for words related to maching learning, data science, etc.

Next, let’s set up our Twitter API Access information.  You can set these up here.

Time to build the listener class.

Now that we have the listener class, let’s set everything up to start listening.

Now you are ready to go. The full script is below. You can store this script as “” and run it as “python” and – assuming you set up mongoDB and your twitter API key’s correctly, you should start collecting Tweets.

The Full Script:


Dask – A better way to work with large CSV files in Python

Dask dataframeIn a recent post titled Working with Large CSV files in Python, I shared an approach I use when I have very large CSV files (and other file types) that are too large to load into memory. While the approach I previously highlighted works well, it can be tedious to first load data into sqllite (or any other database) and then access that database to analyze data.   I just found a better approach using Dask.

While looking around the web to learn about some parallel processing capabilities, I ran across a python module named Dask, which describes itself as:

…is a flexible parallel computing library for analytic computing.

When I saw that, I was intrigued. There’s a lot that can be done with that statement  and I’ve got plans to introduce Dask into my various tool sets for data analytics.

While reading the docs, I ran across the ‘dataframe‘ concept and immediately new I’d found a new tool for working with large CSV files.  With Dask’s dataframe concept,  you can do out-of-core analysis (e.g., analyze data in the CSV without loading the entire CSV file into memory). Other than out-of-core manipulation, dask’s dataframe uses the pandas API, which makes things extremely easy for those of us who use and love pandas.

With Dask and its dataframe construct, you set up the dataframe must like you would in pandas but rather than loading the data into pandas, this appraoch keeps the dataframe as a sort of ‘pointer’ to the data file and doesn’t load anything until you specifically tell it to do so.

One note (that I always have to share):  If you are planning on working with your data set over time, its probably best to get the data into a database of some type.

An example using Dask and the Dataframe

First, let’s get everything installed. The documentation claims that you just need to install dask, but I had to install ‘toolz’ and ‘cloudpickle’ to get dask’s dataframe to import.  To install dask and its requirements, open a terminal and type (you need pip for this):

NOTE: I mistakenly had “pip install dask” listed initially. This only installs the base dask system and not the dataframe (and other dependancies). Thanks to Kevin for pointing this out.

Now, let’s write some code to load csv data and and start analyzing it. For this example, I’m using the 311 Service Requests dataset from NYC’s Open Data portal.   You can download the dataset here: 311 Service Requests – 7Gb+ CSV

Set up your dataframe so you can analyze the 311_Service_Requests.csv file. This file is assumed to be stored in the directory that you are working in.

Unlike pandas, the data isn’t read into memory…we’ve just set up the dataframe to be ready to do some compute functions on the data in the csv file using familiar functions from pandas. Note: I used “dtype=’str'” in the read_csv to get around some strange formatting issues in this particular file.

Let’s take a look at the first few rows of the file using pandas’ head() call.  When you run this, the first X rows (however many rows you are looking at with head(X)) and then displays those rows.

Note: a small subset of the columns are shown below for simplicity

Unique Key Created Date Closed Date Agency
25513481 05/09/2013 12:00:00 AM 05/14/2013 12:00:00 AM HPD
25513482 05/09/2013 12:00:00 AM 05/13/2013 12:00:00 AM HPD
25513483 05/09/2013 12:00:00 AM 05/22/2013 12:00:00 AM HPD
25513484 05/09/2013 12:00:00 AM 05/12/2013 12:00:00 AM HPD
25513485 05/09/2013 12:00:00 AM 05/11/2013 12:00:00 AM HPD

We see that there’s some spaces in the column names. Let’s remove those spaces to make things easier to work with.

The cool thing about dask is that you can do things like renaming columns without loading all the data into memory.

There’s a column in this data called ‘Descriptor’ that has the problem types, and “radiator” is one of those problem types. Let’s take a look at how many service requests were because of some problem with a radiator.  To do this, you can filter the dataframe using standard pandas filtering (see below) to create a new dataframe.

Let’s see how many rows we have using the ‘count’ command

You’ll notice that when you run the above command, you don’t actually get count returned. You get a descriptor back similar  like “dd.Scalar<series-…, dtype=int64>

To actually compute the count, you have to call “compute” to get dask to run through the dataframe and count the number of records.

When you run this command, you should get something like the following

The above are just some samples for using dask’s dataframe construct.  Remember, we built a new dataframe using pandas’ filters without loading the entire original data set into memory.  They may not seem like much, but when working with a 7Gb+ file, you can save a great deal of time and effort using dask when compared to using the approach I previously mentioned.

Dask seems to have a ton of other great features that I’ll be diving into at some point in the near future, but for now, the dataframe construct has been an awesome find.



Working with large CSV files in Python

large csv files in pythonI’m currently working on a project that has multiple very large CSV files (6 gigabytes+). Normally when working with CSV data, I read the data in using pandas and then start munging and analyzing the data. With files this large, reading the data into pandas directly can be difficult (or impossible) due to memory constrictions, especially if you’re working on a prosumer computer. In this post, I describe a method that will help you when working with large CSV files in python.

While it would be pretty straightforward to load the data from these CSV files into a database, there might be times when you don’t have access to a database server and/or you don’t want to go through the hassle of setting up a server.  If you are going to be working on a data set long-term, you absolutely should load that data into a database of some type (mySQL, postgreSQL, etc) but if you just need to do some quick checks / tests / analysis of the data, below is one way to get a look at the data in these large files with python, pandas and sqllite.

To get started, you’ll need to import pandas and sqlalchemy. The commands below will do that.

Next, set up a variable that points to your csv file.  This isn’t necessary but it does help in re-usability.

With these three lines of code, we are ready to start analyzing our data. Let’s take a look at the ‘head’ of the csv file to see what the contents might look like.

This command uses pandas’ “read_csv” command to read in only 5 rows (nrows=5) and then print those rows to the screen. This lets you understand the structure of the csv file and make sure the data is formatted in a way that makes sense for your work.

Before we can actually work with the data, we need to do something with it so we can begin to filter it to work with subsets of the data. This is usually what I would use pandas’ dataframe for but with large data files, we need to store the data somewhere else. In this case, we’ll set up a local sqllite database, read the csv file in chunks and then write those chunks to sqllite.

To do this, we’ll first need to create the sqllite database using the following command.

Next, we need to iterate through the CSV file in chunks and store the data into sqllite.

With this code, we are setting the chunksize at 100,000 to keep the size of the chunks managable, initializing a couple of iterators (i=0, j=0) and then running a through a for loop.  The for loop read a chunk of data from the CSV file, removes space from any of column names, then stores the chunk into the sqllite database (df.to_sql(…)).

This might take a while if your CSV file is sufficiently large, but the time spent waiting is worth it because you can now use pandas ‘sql’ tools to pull data from the database without worrying about memory constraints.

To access the data now, you can run commands like the following:

Of course, using ‘select *…’ will load all data into memory, which is the problem we are trying to get away from so you should throw from filters into your select statements to filter the data. For example: