Collecting / Storing Tweets with Python and MongoDB

A good amount of the work that I do involves using social media content for analyzing networks, sentiment, influencers and other various types of analysis.

In order to do this type of analysis, you first need to have some data to analyze.  You can also scrape websites like Twitter or Facebook using simple web scrapers, but I’ve always found it easier to use the API’s that these companies / websites provide to pull down data.

The Twitter Streaming API is ideal for grabbing data in real-time and storing it for analysis. Twitter also has a search API that lets you pull down a certain number of historical tweets (I think I read it was the last 1,000 tweets…but its been a while since I’ve looked at the Search API).   I’m a fan of the Streaming API because it lets me grab a much larger set of data than the Search API, but it requires you to build a script that ‘listens’ to the API for your required keywords and then store those tweets somewhere for later analysis.

There are tons of ways to connect up to the Streaming API. There are also quite a few Twitter API wrappers for Python (and most of them work very well).   I tend to use Tweepy more than others due to its ease of use and simple structure. Additionally, if I’m working on a small / short-term project, I tend to reach for MongoDB to store the tweets using the PyMongo module. For larger / longer-term projects I usually connect the streaming API script to MySQL instead of MongoDB simply because MySQL fits into my ecosystem of backup scripts, etc better than MongoDB does.  MongoDB is perfectly suited for this type of work for larger projects…I just tend to swing toward MySQL for those projects.

For this post, I wanted to share my script for collecting Tweets from the Twitter API and storing them into MongoDB.

Note: This script is a mashup of many other scripts I’ve found on the web over the years. I don’t recall where I found the pieces/parts of this script but I don’t want to discount the help I had from other people / sites in building this script.

Collecting / Storing Tweets with Python and MongoDB

Let’s set up our imports:

from __future__ import print_function
import tweepy
import json
from pymongo import MongoClient

Next, set up your mongoDB path:

MONGO_HOST= 'mongodb://localhost/twitterdb'  # assuming you have mongoDB installed locally
                                             # and a database called 'twitterdb'

Next, set up the words that you want to ‘listen’ for on Twitter. You can use words or phrases seperated by commas.

WORDS = ['#bigdata', '#AI', '#datascience', '#machinelearning', '#ml', '#iot']

Here, I’m listening for words related to maching learning, data science, etc.

Next, let’s set up our Twitter API Access information.  You can set these up here.

CONSUMER_KEY = "KEY"
CONSUMER_SECRET = "SECRET"
ACCESS_TOKEN = "TOKEN"
ACCESS_TOKEN_SECRET = "TOKEN_SECRET"

Time to build the listener class.

class StreamListener(tweepy.StreamListener):    
    #This is a class provided by tweepy to access the Twitter Streaming API. 
    def on_connect(self):
        # Called initially to connect to the Streaming API
        print("You are now connected to the streaming API.")
 
    def on_error(self, status_code):
        # On error - if an error occurs, display the error / status code
        print('An Error has occured: ' + repr(status_code))
        return False
 
    def on_data(self, data):
        #This is the meat of the script...it connects to your mongoDB and stores the tweet
        try:
            client = MongoClient(MONGO_HOST)
            
            # Use twitterdb database. If it doesn't exist, it will be created.
            db = client.twitterdb
    
            # Decode the JSON from Twitter
            datajson = json.loads(data)
            
            #grab the 'created_at' data from the Tweet to use for display
            created_at = datajson['created_at']
            #print out a message to the screen that we have collected a tweet
            print("Tweet collected at " + str(created_at))
            
            #insert the data into the mongoDB into a collection called twitter_search
            #if twitter_search doesn't exist, it will be created.
            db.twitter_search.insert(datajson)
        except Exception as e:
           print(e)

Now that we have the listener class, let’s set everything up to start listening.

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
#Set up the listener. The 'wait_on_rate_limit=True' is needed to help with Twitter API rate limiting.
listener = StreamListener(api=tweepy.API(wait_on_rate_limit=True)) 
streamer = tweepy.Stream(auth=auth, listener=listener)
print("Tracking: " + str(WORDS))
streamer.filter(track=WORDS)

Now you are ready to go. The full script is below. You can store this script as “streaming_API.py” and run it as “python streaming_API.py” and – assuming you set up mongoDB and your twitter API key’s correctly, you should start collecting Tweets.

The Full Script:

from __future__ import print_function
import tweepy
import json
from pymongo import MongoClient
MONGO_HOST= 'mongodb://localhost/twitterdb'  # assuming you have mongoDB installed locally
                                             # and a database called 'twitterdb'
WORDS = ['#bigdata', '#AI', '#datascience', '#machinelearning', '#ml', '#iot']
CONSUMER_KEY = "KEY"
CONSUMER_SECRET = "SECRET"
ACCESS_TOKEN = "TOKEN"
ACCESS_TOKEN_SECRET = "TOKEN_SECRET"

class StreamListener(tweepy.StreamListener):    
    #This is a class provided by tweepy to access the Twitter Streaming API. 
    def on_connect(self):
        # Called initially to connect to the Streaming API
        print("You are now connected to the streaming API.")
 
    def on_error(self, status_code):
        # On error - if an error occurs, display the error / status code
        print('An Error has occured: ' + repr(status_code))
        return False
 
    def on_data(self, data):
        #This is the meat of the script...it connects to your mongoDB and stores the tweet
        try:
            client = MongoClient(MONGO_HOST)
            
            # Use twitterdb database. If it doesn't exist, it will be created.
            db = client.twitterdb
    
            # Decode the JSON from Twitter
            datajson = json.loads(data)
            
            #grab the 'created_at' data from the Tweet to use for display
            created_at = datajson['created_at']
            #print out a message to the screen that we have collected a tweet
            print("Tweet collected at " + str(created_at))
            
            #insert the data into the mongoDB into a collection called twitter_search
            #if twitter_search doesn't exist, it will be created.
            db.twitter_search.insert(datajson)
        except Exception as e:
           print(e)
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
#Set up the listener. The 'wait_on_rate_limit=True' is needed to help with Twitter API rate limiting.
listener = StreamListener(api=tweepy.API(wait_on_rate_limit=True)) 
streamer = tweepy.Stream(auth=auth, listener=listener)
print("Tracking: " + str(WORDS))
streamer.filter(track=WORDS)

 

5 1 vote
Article Rating
42 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Haytham Amin
Haytham Amin
6 years ago

Can you share a MYSQL implementation of the same flow please?

Haytham Amin
Haytham Amin
6 years ago

oh, great. Thanks

Rachel Solomon
Rachel Solomon
6 years ago

Hi, was just wondering if you wanted to look at the data collected, how would you query it with the find() method? Or is there another way entirely?
Great post btw, very helpful!

gunner
6 years ago

Hi,
I did collect and insert the data into mongodb. I then transfered the data from mongodb to a json file. How, the file seems corrupt since it can be open in Python notebook. the ERROR is as follow:

JSONDecodeError: Extra data: line 2 column 1 (char 3656)

sylvain
sylvain
6 years ago

When i run the program, it is said that “an integer is required” ! I can’t fix this problem. Any help?

Tracking: [‘#bigdata’, ‘#AI’, ‘#datascience’, ‘#machinelearning’, ‘#ml’, ‘#iot’]
You are now connected to the streaming API.
Tweet collected at 2017-04-29 12:56:22+00:00
an integer is required
Tweet collected at 2017-04-29 12:56:23+00:00
an integer is required

paks
paks
6 years ago

Hi. I’d like to know if you’ve tried continuously querying data from the db while the stream is still runnning.

I’m currently working on a trend detection model and I’m also working with mongodb as my database and the python module tweepy.

Rebecca Cunningham
Rebecca Cunningham
6 years ago

Hi Eric, Thank you this is very helpful. I’m trying my hardest to get a twitter api streaming through into a mongodb. In regards to setting up the mongo db, do I have to tell it how to collate things – for example the twitter columns? I thought as it was dragging in json files it would automatically put time, date, user name, ip, tweet, geo location etc into a string and then would columnise automatically? is that the case?
Thank you in advance for your advice.
best wishes
Rebecca

Adekunle Babatunde
Adekunle Babatunde
6 years ago

Hi Eric, great post. I have a question. If I am going to adapt this methodology to create a huge project, that seeks to use big data i.e tweets, blog posts and co. in a data lake, – I will want to continually store the metadata as regards a twitter search – would you advise saving the search directly into MongoDB or I should use a JSON file format to save the data initially and pick out the information I care about per-time into a Database probably MongoDB too or Mysql. I am also thinking that if I am going… Read more »

Divya
Divya
5 years ago

Why does this code only collects the data of past 8hours or so.
What if I want to collect the tweets in backward fashion that is the manner in which we see on twitter.

Murtaza
Murtaza
5 years ago

Hi,
For how long will the streaming api run and how can i stop it and view my data ?

hurriat
hurriat
5 years ago

hey i am trying to run this code and getting an error mongo client module not found .could you please help me out with this ? i have installed the latest version of pymongo using pip .

Nadkalpur Manjunath
5 years ago

Eric,
An excellent tutorial. The code works as it is. I am new to MongoDB and after running the code, I added two lines to the above script:
#See what databases are available under MongoDB
dbs = MongoClient().database_names()
dbs
Later, I fired up Compass tool of MongoDB and it listed the twitterdb database. Thank you.

Jude Kelvin
Jude Kelvin
5 years ago

Hi,
this is simply superb!
Thank you.

Mark Venn
Mark Venn
5 years ago

Eric
Thanks for this tutorial. I am a lecturer at a UK uni and we are looking to start a new unit next year that will include this type of data analysis and we will be using mongodb. So perfect timing as we have to plan the unit now!

trackback

[…] just used a script that I found here. Indeed, the code is already working at it is, and I just needed to fill it with my information to […]

Jainil
Jainil
5 years ago

Hey! Thanks a lot for this. I am having a little issue. When I run it this is what it shows: Tracking: [‘#bigdata’, ‘#AI’, ‘#datascience’, ‘#machinelearning’, ‘#ml’, ‘#iot’] You are now connected to the streaming API. Tweet collected at Thu Apr 12 06:09:36 +0000 2018 localhost:27017: [WinError 10061] No connection could be made because the target machine actively refused it Tweet collected at Thu Apr 12 06:09:36 +0000 2018 localhost:27017: [WinError 10061] No connection could be made because the target machine actively refused it Tweet collected at Thu Apr 12 06:09:36 +0000 2018 localhost:27017: [WinError 10061] No connection could be… Read more »

Miguel
Miguel
5 years ago

Dr. Brown, I’m using the tweet text to do sentiment analysis.

#grab the tweet text’ data from the tweet to use for vader sentiment analayser
text = datajson[‘text’]
analyzer = SentimentIntensityAnalyzer()
vs = analyzer.polarity_scores(text)

I want to insert this additional text into the same document of each tweet, I can’t seem to understand how to insert in the same document. can you help out ?

Bbo Wuff
Bbo Wuff
5 years ago

Hey Eric,

Nice tutorial. I just realized that “created_at” is stored as a string in MongoDB.
Seems to have following format: %a %b %d %H:%M:%S +0000 %Y

I was thinking about strptime in python, but isn’t there a way to do it in MongoDB as well?

Best Wuff

JB
JB
5 years ago

Hi, I’m working on a project of sentiment analysis based on tweets analysis and I’m looking for the right set of tools to build a online dashboard, accessible for my clients. For now I just store daily batches on a AWS S3 and access them with python to make the pre-processing & modelling. Would you say a Mongodb storage and Elastic search + Kibana running on top of it is a good approach? If so, (I have no experience with mongodb) should I store the raw data in a first collection just as you did in this demo and create… Read more »

Tony
Tony
4 years ago

Hi all, how do i incorporate language translation using goslate for ‘text’ and a new field call text1 if the lang field is not ‘en’?

Saswata Roy
Saswata Roy
4 years ago

First of all, i would like to appreciate your effort to provide such a simple but elegant code. But i have a question. I stored tweets successfully in twitter_search collection in twitterdb Database. Then when i try to collect only those tweets whose “created _at” lies between a range for example, from_date=”Fri Mar 29 10:03:37 +0000 2019″ to
to_date=”Fri Mar 29 08:26:08 +0000 2019″ i could not get these tweets. Can you please Explain ?

marcusRB
4 years ago

Hi Mr. Brown, many thanks for your post, after many attemps I connected MongoDb with your script in Jupyter. Now I have data stored and I’m ready to analyze them.
The point is, If I need to execute the script, example each 12h to collect tweets, (in my project I need to get hashtags, keywords in a specific topic), what is the second step to create aspecific rule to do it ? I’m reading other comments and I saw “Flask” or web app?

Thanks for to clue and your help.
Cheers
marcusRB

Pablo Augusto Correa Causa
Pablo Augusto Correa Causa
4 years ago

Where does it store the tweets? I cannot find them. Sorry, i’m totally new in this.