Collecting / Storing Tweets with Python and MongoDB

A good amount of the work that I do involves using social media content for analyzing networks, sentiment, influencers and other various types of analysis.

In order to do this type of analysis, you first need to have some data to analyze.  You can also scrape websites like Twitter or Facebook using simple web scrapers, but I’ve always found it easier to use the API’s that these companies / websites provide to pull down data.

The Twitter Streaming API is ideal for grabbing data in real-time and storing it for analysis. Twitter also has a search API that lets you pull down a certain number of historical tweets (I think I read it was the last 1,000 tweets…but its been a while since I’ve looked at the Search API).   I’m a fan of the Streaming API because it lets me grab a much larger set of data than the Search API, but it requires you to build a script that ‘listens’ to the API for your required keywords and then store those tweets somewhere for later analysis.

There are tons of ways to connect up to the Streaming API. There are also quite a few Twitter API wrappers for Python (and most of them work very well).   I tend to use Tweepy more than others due to its ease of use and simple structure. Additionally, if I’m working on a small / short-term project, I tend to reach for MongoDB to store the tweets using the PyMongo module. For larger / longer-term projects I usually connect the streaming API script to MySQL instead of MongoDB simply because MySQL fits into my ecosystem of backup scripts, etc better than MongoDB does.  MongoDB is perfectly suited for this type of work for larger projects…I just tend to swing toward MySQL for those projects.

For this post, I wanted to share my script for collecting Tweets from the Twitter API and storing them into MongoDB.

Note: This script is a mashup of many other scripts I’ve found on the web over the years. I don’t recall where I found the pieces/parts of this script but I don’t want to discount the help I had from other people / sites in building this script.

Collecting / Storing Tweets with Python and MongoDB

Let’s set up our imports:

Next, set up your mongoDB path:

Next, set up the words that you want to ‘listen’ for on Twitter. You can use words or phrases seperated by commas.

Here, I’m listening for words related to maching learning, data science, etc.

Next, let’s set up our Twitter API Access information.  You can set these up here.

Time to build the listener class.

Now that we have the listener class, let’s set everything up to start listening.

Now you are ready to go. The full script is below. You can store this script as “streaming_API.py” and run it as “python streaming_API.py” and – assuming you set up mongoDB and your twitter API key’s correctly, you should start collecting Tweets.

The Full Script:

 

Eric D. Brown , D.Sc. has a doctorate in Information Systems with a specialization in Data Sciences, Decision Support and Knowledge Management. He writes about utilizing python for data analytics at pythondata.com and the crossroads of technology and strategy at ericbrown.com

19 thoughts on “Collecting / Storing Tweets with Python and MongoDB

  1. Hi, was just wondering if you wanted to look at the data collected, how would you query it with the find() method? Or is there another way entirely?
    Great post btw, very helpful!

    1. Hi Rachel –

      You would us the find() method to query the database.

      For example, to grab 200 of the last records sorted by the ‘created_at’ value, you can use the following code

      cursor = mongo_coll.find().limit(200).sort([("created_at",-1)])

  2. Hi,
    I did collect and insert the data into mongodb. I then transfered the data from mongodb to a json file. How, the file seems corrupt since it can be open in Python notebook. the ERROR is as follow:

    JSONDecodeError: Extra data: line 2 column 1 (char 3656)

  3. When i run the program, it is said that “an integer is required” ! I can’t fix this problem. Any help?

    Tracking: [‘#bigdata’, ‘#AI’, ‘#datascience’, ‘#machinelearning’, ‘#ml’, ‘#iot’]
    You are now connected to the streaming API.
    Tweet collected at 2017-04-29 12:56:22+00:00
    an integer is required
    Tweet collected at 2017-04-29 12:56:23+00:00
    an integer is required

  4. Hi. I’d like to know if you’ve tried continuously querying data from the db while the stream is still runnning.

    I’m currently working on a trend detection model and I’m also working with mongodb as my database and the python module tweepy.

  5. Hi Eric, Thank you this is very helpful. I’m trying my hardest to get a twitter api streaming through into a mongodb. In regards to setting up the mongo db, do I have to tell it how to collate things – for example the twitter columns? I thought as it was dragging in json files it would automatically put time, date, user name, ip, tweet, geo location etc into a string and then would columnise automatically? is that the case?
    Thank you in advance for your advice.
    best wishes
    Rebecca

    1. Hi Rebecca – I’m sorry…I missed your comment.

      You should be able to just start storing in mongo (in a new data store/db) and it will collate things for you based on the data you are sending it. You do need to tell it that you want ‘date’ to be stored in a field titled ‘date’, etc though.

  6. Hi Eric, great post. I have a question. If I am going to adapt this methodology to create a huge project, that seeks to use big data i.e tweets, blog posts and co. in a data lake, – I will want to continually store the metadata as regards a twitter search – would you advise saving the search directly into MongoDB or I should use a JSON file format to save the data initially and pick out the information I care about per-time into a Database probably MongoDB too or Mysql. I am also thinking that if I am going to use a JSON storage then the Json data can continually be stored in an object oriented Database indexed maybe per date of acquisition.

    1. This is a good question.

      I guess it really depends on whether you are going to be doing this via streaming (e.g., real-time) or batch (e.g., every few minutes).

      Ultimately, the data would want to go into some form of data storage (mongodb, mysql, etc) for easier access for analysis. If you are grabbing streaming data, throw it into a real-time store and then access it later for analysis and then store it in your longer term store. If batch, skip the initial storage step and just grab the data, analyze it and store what you need and skip the intermediate step.

      If you are really looking at streaming data, probably best to look at something like AWS’ Kinesis or similar products (if you have budget for it).

  7. Why does this code only collects the data of past 8hours or so.
    What if I want to collect the tweets in backward fashion that is the manner in which we see on twitter.

    1. This code only collects tweets for the time it is running using the streaming API. If you want to grab the historical tweets, you’ll to use the search API but that will only return a certain number (it used to be 1000 tweets) instead of the entire history. To get full histories from twitter, you’ll need to scrape twitter and/or pay for the firehose access (which is cost-prohibitive to anyone but the largest companies)

  8. hey i am trying to run this code and getting an error mongo client module not found .could you please help me out with this ? i have installed the latest version of pymongo using pip .

Leave a Reply

Your email address will not be published. Required fields are marked *