Collecting / Storing Tweets with Python and MongoDB

A good amount of the work that I do involves using social media content for analyzing networks, sentiment, influencers and other various types of analysis.

In order to do this type of analysis, you first need to have some data to analyze.  You can also scrape websites like Twitter or Facebook using simple web scrapers, but I’ve always found it easier to use the API’s that these companies / websites provide to pull down data.

The Twitter Streaming API is ideal for grabbing data in real-time and storing it for analysis. Twitter also has a search API that lets you pull down a certain number of historical tweets (I think I read it was the last 1,000 tweets…but its been a while since I’ve looked at the Search API).   I’m a fan of the Streaming API because it lets me grab a much larger set of data than the Search API, but it requires you to build a script that ‘listens’ to the API for your required keywords and then store those tweets somewhere for later analysis.

There are tons of ways to connect up to the Streaming API. There are also quite a few Twitter API wrappers for Python (and most of them work very well).   I tend to use Tweepy more than others due to its ease of use and simple structure. Additionally, if I’m working on a small / short-term project, I tend to reach for MongoDB to store the tweets using the PyMongo module. For larger / longer-term projects I usually connect the streaming API script to MySQL instead of MongoDB simply because MySQL fits into my ecosystem of backup scripts, etc better than MongoDB does.  MongoDB is perfectly suited for this type of work for larger projects…I just tend to swing toward MySQL for those projects.

For this post, I wanted to share my script for collecting Tweets from the Twitter API and storing them into MongoDB.

Note: This script is a mashup of many other scripts I’ve found on the web over the years. I don’t recall where I found the pieces/parts of this script but I don’t want to discount the help I had from other people / sites in building this script.

Collecting / Storing Tweets with Python and MongoDB

Let’s set up our imports:

Next, set up your mongoDB path:

Next, set up the words that you want to ‘listen’ for on Twitter. You can use words or phrases seperated by commas.

Here, I’m listening for words related to maching learning, data science, etc.

Next, let’s set up our Twitter API Access information.  You can set these up here.

Time to build the listener class.

Now that we have the listener class, let’s set everything up to start listening.

Now you are ready to go. The full script is below. You can store this script as “” and run it as “python” and – assuming you set up mongoDB and your twitter API key’s correctly, you should start collecting Tweets.

The Full Script:


Eric D. Brown , D.Sc. has a doctorate in Information Systems with a specialization in Data Sciences, Decision Support and Knowledge Management. He writes about utilizing python for data analytics at and the crossroads of technology and strategy at

34 thoughts on “Collecting / Storing Tweets with Python and MongoDB

  1. Hi, was just wondering if you wanted to look at the data collected, how would you query it with the find() method? Or is there another way entirely?
    Great post btw, very helpful!

    1. Hi Rachel –

      You would us the find() method to query the database.

      For example, to grab 200 of the last records sorted by the ‘created_at’ value, you can use the following code

      cursor = mongo_coll.find().limit(200).sort([("created_at",-1)])

  2. Hi,
    I did collect and insert the data into mongodb. I then transfered the data from mongodb to a json file. How, the file seems corrupt since it can be open in Python notebook. the ERROR is as follow:

    JSONDecodeError: Extra data: line 2 column 1 (char 3656)

  3. When i run the program, it is said that “an integer is required” ! I can’t fix this problem. Any help?

    Tracking: [‘#bigdata’, ‘#AI’, ‘#datascience’, ‘#machinelearning’, ‘#ml’, ‘#iot’]
    You are now connected to the streaming API.
    Tweet collected at 2017-04-29 12:56:22+00:00
    an integer is required
    Tweet collected at 2017-04-29 12:56:23+00:00
    an integer is required

  4. Hi. I’d like to know if you’ve tried continuously querying data from the db while the stream is still runnning.

    I’m currently working on a trend detection model and I’m also working with mongodb as my database and the python module tweepy.

  5. Hi Eric, Thank you this is very helpful. I’m trying my hardest to get a twitter api streaming through into a mongodb. In regards to setting up the mongo db, do I have to tell it how to collate things – for example the twitter columns? I thought as it was dragging in json files it would automatically put time, date, user name, ip, tweet, geo location etc into a string and then would columnise automatically? is that the case?
    Thank you in advance for your advice.
    best wishes

    1. Hi Rebecca – I’m sorry…I missed your comment.

      You should be able to just start storing in mongo (in a new data store/db) and it will collate things for you based on the data you are sending it. You do need to tell it that you want ‘date’ to be stored in a field titled ‘date’, etc though.

  6. Hi Eric, great post. I have a question. If I am going to adapt this methodology to create a huge project, that seeks to use big data i.e tweets, blog posts and co. in a data lake, – I will want to continually store the metadata as regards a twitter search – would you advise saving the search directly into MongoDB or I should use a JSON file format to save the data initially and pick out the information I care about per-time into a Database probably MongoDB too or Mysql. I am also thinking that if I am going to use a JSON storage then the Json data can continually be stored in an object oriented Database indexed maybe per date of acquisition.

    1. This is a good question.

      I guess it really depends on whether you are going to be doing this via streaming (e.g., real-time) or batch (e.g., every few minutes).

      Ultimately, the data would want to go into some form of data storage (mongodb, mysql, etc) for easier access for analysis. If you are grabbing streaming data, throw it into a real-time store and then access it later for analysis and then store it in your longer term store. If batch, skip the initial storage step and just grab the data, analyze it and store what you need and skip the intermediate step.

      If you are really looking at streaming data, probably best to look at something like AWS’ Kinesis or similar products (if you have budget for it).

  7. Why does this code only collects the data of past 8hours or so.
    What if I want to collect the tweets in backward fashion that is the manner in which we see on twitter.

    1. This code only collects tweets for the time it is running using the streaming API. If you want to grab the historical tweets, you’ll to use the search API but that will only return a certain number (it used to be 1000 tweets) instead of the entire history. To get full histories from twitter, you’ll need to scrape twitter and/or pay for the firehose access (which is cost-prohibitive to anyone but the largest companies)

  8. hey i am trying to run this code and getting an error mongo client module not found .could you please help me out with this ? i have installed the latest version of pymongo using pip .

  9. Eric,
    An excellent tutorial. The code works as it is. I am new to MongoDB and after running the code, I added two lines to the above script:
    #See what databases are available under MongoDB
    dbs = MongoClient().database_names()
    Later, I fired up Compass tool of MongoDB and it listed the twitterdb database. Thank you.

  10. Eric
    Thanks for this tutorial. I am a lecturer at a UK uni and we are looking to start a new unit next year that will include this type of data analysis and we will be using mongodb. So perfect timing as we have to plan the unit now!

  11. Hey! Thanks a lot for this. I am having a little issue. When I run it this is what it shows:

    Tracking: [‘#bigdata’, ‘#AI’, ‘#datascience’, ‘#machinelearning’, ‘#ml’, ‘#iot’]
    You are now connected to the streaming API.
    Tweet collected at Thu Apr 12 06:09:36 +0000 2018
    localhost:27017: [WinError 10061] No connection could be made because the target machine actively refused it
    Tweet collected at Thu Apr 12 06:09:36 +0000 2018
    localhost:27017: [WinError 10061] No connection could be made because the target machine actively refused it
    Tweet collected at Thu Apr 12 06:09:36 +0000 2018
    localhost:27017: [WinError 10061] No connection could be made because the target machine actively refused it
    Tweet collected at Thu Apr 12 06:09:38 +0000 2018
    localhost:27017: [WinError 10061] No connection could be made because the target machine actively refused it
    Tweet collected at Thu Apr 12 06:09:39 +0000 2018
    localhost:27017: [WinError 10061] No connection could be made because the target machine actively refused it
    Tweet collected at Thu Apr 12 06:09:40 +0000 2018
    localhost:27017: [WinError 10061] No connection could be made because the target machine actively refused it
    Tweet collected at Thu Apr 12 06:09:40 +0000 2018

  12. Dr. Brown, I’m using the tweet text to do sentiment analysis.

    #grab the tweet text’ data from the tweet to use for vader sentiment analayser
    text = datajson[‘text’]
    analyzer = SentimentIntensityAnalyzer()
    vs = analyzer.polarity_scores(text)

    I want to insert this additional text into the same document of each tweet, I can’t seem to understand how to insert in the same document. can you help out ?

  13. Hey Eric,

    Nice tutorial. I just realized that “created_at” is stored as a string in MongoDB.
    Seems to have following format: %a %b %d %H:%M:%S +0000 %Y

    I was thinking about strptime in python, but isn’t there a way to do it in MongoDB as well?

    Best Wuff

    1. When you insert by column you can set a format. For example, you could use something like this: db[‘created_at’].insert({“date” : d}). That would require you to convert from inserting the fully json data and insert by column.

  14. Hi, I’m working on a project of sentiment analysis based on tweets analysis and I’m looking for the right set of tools to build a online dashboard, accessible for my clients. For now I just store daily batches on a AWS S3 and access them with python to make the pre-processing & modelling.
    Would you say a Mongodb storage and Elastic search + Kibana running on top of it is a good approach?
    If so, (I have no experience with mongodb) should I store the raw data in a first collection just as you did in this demo and create a second collection of processed data on which apply my models? Thanks for your help!

    1. Hi JB –

      My approach to this is to handle all the modeling / processing in the back-end with python and store the outputs of the processing in a database (I use MySQL). I then present the data via Flask / JQuery.

      I don’t know exactly what you are doing and/or if you need real-time presentation on the front-end. If you don’t need real-time, you could take a similar approach.

      Feel free to ping with more questions…happy to help if I can.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.