A good amount of the work that I do involves using social media content for analyzing networks, sentiment, influencers and other various types of analysis.
In order to do this type of analysis, you first need to have some data to analyze. You can also scrape websites like Twitter or Facebook using simple web scrapers, but I’ve always found it easier to use the API’s that these companies / websites provide to pull down data.
The Twitter Streaming API is ideal for grabbing data in real-time and storing it for analysis. Twitter also has a search API that lets you pull down a certain number of historical tweets (I think I read it was the last 1,000 tweets…but its been a while since I’ve looked at the Search API). I’m a fan of the Streaming API because it lets me grab a much larger set of data than the Search API, but it requires you to build a script that ‘listens’ to the API for your required keywords and then store those tweets somewhere for later analysis.
There are tons of ways to connect up to the Streaming API. There are also quite a few Twitter API wrappers for Python (and most of them work very well). I tend to use Tweepy more than others due to its ease of use and simple structure. Additionally, if I’m working on a small / short-term project, I tend to reach for MongoDB to store the tweets using the PyMongo module. For larger / longer-term projects I usually connect the streaming API script to MySQL instead of MongoDB simply because MySQL fits into my ecosystem of backup scripts, etc better than MongoDB does. MongoDB is perfectly suited for this type of work for larger projects…I just tend to swing toward MySQL for those projects.
For this post, I wanted to share my script for collecting Tweets from the Twitter API and storing them into MongoDB.
Note: This script is a mashup of many other scripts I’ve found on the web over the years. I don’t recall where I found the pieces/parts of this script but I don’t want to discount the help I had from other people / sites in building this script.
Collecting / Storing Tweets with Python and MongoDB
Let’s set up our imports:
from __future__ import print_function import tweepy import json from pymongo import MongoClient
Next, set up your mongoDB path:
MONGO_HOST= 'mongodb://localhost/twitterdb' # assuming you have mongoDB installed locally # and a database called 'twitterdb'
Next, set up the words that you want to ‘listen’ for on Twitter. You can use words or phrases seperated by commas.
WORDS = ['#bigdata', '#AI', '#datascience', '#machinelearning', '#ml', '#iot']
Here, I’m listening for words related to maching learning, data science, etc.
Next, let’s set up our Twitter API Access information. You can set these up here.
CONSUMER_KEY = "KEY" CONSUMER_SECRET = "SECRET" ACCESS_TOKEN = "TOKEN" ACCESS_TOKEN_SECRET = "TOKEN_SECRET"
Time to build the listener class.
class StreamListener(tweepy.StreamListener): #This is a class provided by tweepy to access the Twitter Streaming API. def on_connect(self): # Called initially to connect to the Streaming API print("You are now connected to the streaming API.") def on_error(self, status_code): # On error - if an error occurs, display the error / status code print('An Error has occured: ' + repr(status_code)) return False def on_data(self, data): #This is the meat of the script...it connects to your mongoDB and stores the tweet try: client = MongoClient(MONGO_HOST) # Use twitterdb database. If it doesn't exist, it will be created. db = client.twitterdb # Decode the JSON from Twitter datajson = json.loads(data) #grab the 'created_at' data from the Tweet to use for display created_at = datajson['created_at'] #print out a message to the screen that we have collected a tweet print("Tweet collected at " + str(created_at)) #insert the data into the mongoDB into a collection called twitter_search #if twitter_search doesn't exist, it will be created. db.twitter_search.insert(datajson) except Exception as e: print(e)
Now that we have the listener class, let’s set everything up to start listening.
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET) auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET) #Set up the listener. The 'wait_on_rate_limit=True' is needed to help with Twitter API rate limiting. listener = StreamListener(api=tweepy.API(wait_on_rate_limit=True)) streamer = tweepy.Stream(auth=auth, listener=listener) print("Tracking: " + str(WORDS)) streamer.filter(track=WORDS)
Now you are ready to go. The full script is below. You can store this script as “streaming_API.py” and run it as “python streaming_API.py” and – assuming you set up mongoDB and your twitter API key’s correctly, you should start collecting Tweets.
The Full Script:
from __future__ import print_function import tweepy import json from pymongo import MongoClient MONGO_HOST= 'mongodb://localhost/twitterdb' # assuming you have mongoDB installed locally # and a database called 'twitterdb' WORDS = ['#bigdata', '#AI', '#datascience', '#machinelearning', '#ml', '#iot'] CONSUMER_KEY = "KEY" CONSUMER_SECRET = "SECRET" ACCESS_TOKEN = "TOKEN" ACCESS_TOKEN_SECRET = "TOKEN_SECRET" class StreamListener(tweepy.StreamListener): #This is a class provided by tweepy to access the Twitter Streaming API. def on_connect(self): # Called initially to connect to the Streaming API print("You are now connected to the streaming API.") def on_error(self, status_code): # On error - if an error occurs, display the error / status code print('An Error has occured: ' + repr(status_code)) return False def on_data(self, data): #This is the meat of the script...it connects to your mongoDB and stores the tweet try: client = MongoClient(MONGO_HOST) # Use twitterdb database. If it doesn't exist, it will be created. db = client.twitterdb # Decode the JSON from Twitter datajson = json.loads(data) #grab the 'created_at' data from the Tweet to use for display created_at = datajson['created_at'] #print out a message to the screen that we have collected a tweet print("Tweet collected at " + str(created_at)) #insert the data into the mongoDB into a collection called twitter_search #if twitter_search doesn't exist, it will be created. db.twitter_search.insert(datajson) except Exception as e: print(e) auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET) auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET) #Set up the listener. The 'wait_on_rate_limit=True' is needed to help with Twitter API rate limiting. listener = StreamListener(api=tweepy.API(wait_on_rate_limit=True)) streamer = tweepy.Stream(auth=auth, listener=listener) print("Tracking: " + str(WORDS)) streamer.filter(track=WORDS)
Can you share a MYSQL implementation of the same flow please?
The script would basically remain the same except for line 48. Instead of
db.twitter_search.insert(datajson)
, you’d need to split the json up into the individual fields that you are interested in and then store those fields using the python mysql connector.I posted a walk-through here http://pythondata.wpengine.com/collecting-storing-tweets-python-mysql/
oh, great. Thanks
Hi, was just wondering if you wanted to look at the data collected, how would you query it with the find() method? Or is there another way entirely?
Great post btw, very helpful!
Hi Rachel –
You would us the find() method to query the database.
For example, to grab 200 of the last records sorted by the ‘created_at’ value, you can use the following code
cursor = mongo_coll.find().limit(200).sort([("created_at",-1)])
Hi,
I did collect and insert the data into mongodb. I then transfered the data from mongodb to a json file. How, the file seems corrupt since it can be open in Python notebook. the ERROR is as follow:
JSONDecodeError: Extra data: line 2 column 1 (char 3656)
When i run the program, it is said that “an integer is required” ! I can’t fix this problem. Any help?
Tracking: [‘#bigdata’, ‘#AI’, ‘#datascience’, ‘#machinelearning’, ‘#ml’, ‘#iot’]
You are now connected to the streaming API.
Tweet collected at 2017-04-29 12:56:22+00:00
an integer is required
Tweet collected at 2017-04-29 12:56:23+00:00
an integer is required
Hi. I’d like to know if you’ve tried continuously querying data from the db while the stream is still runnning.
I’m currently working on a trend detection model and I’m also working with mongodb as my database and the python module tweepy.
I have, but for real-time or near-real-time streaming analysis I tend go with something like Kinesis rather than the approach outlined here.
Hi Eric, Thank you this is very helpful. I’m trying my hardest to get a twitter api streaming through into a mongodb. In regards to setting up the mongo db, do I have to tell it how to collate things – for example the twitter columns? I thought as it was dragging in json files it would automatically put time, date, user name, ip, tweet, geo location etc into a string and then would columnise automatically? is that the case?
Thank you in advance for your advice.
best wishes
Rebecca
Hi Rebecca – I’m sorry…I missed your comment.
You should be able to just start storing in mongo (in a new data store/db) and it will collate things for you based on the data you are sending it. You do need to tell it that you want ‘date’ to be stored in a field titled ‘date’, etc though.
Hi Eric, great post. I have a question. If I am going to adapt this methodology to create a huge project, that seeks to use big data i.e tweets, blog posts and co. in a data lake, – I will want to continually store the metadata as regards a twitter search – would you advise saving the search directly into MongoDB or I should use a JSON file format to save the data initially and pick out the information I care about per-time into a Database probably MongoDB too or Mysql. I am also thinking that if I am going… Read more »
This is a good question. I guess it really depends on whether you are going to be doing this via streaming (e.g., real-time) or batch (e.g., every few minutes). Ultimately, the data would want to go into some form of data storage (mongodb, mysql, etc) for easier access for analysis. If you are grabbing streaming data, throw it into a real-time store and then access it later for analysis and then store it in your longer term store. If batch, skip the initial storage step and just grab the data, analyze it and store what you need and skip the… Read more »
Why does this code only collects the data of past 8hours or so.
What if I want to collect the tweets in backward fashion that is the manner in which we see on twitter.
This code only collects tweets for the time it is running using the streaming API. If you want to grab the historical tweets, you’ll to use the search API but that will only return a certain number (it used to be 1000 tweets) instead of the entire history. To get full histories from twitter, you’ll need to scrape twitter and/or pay for the firehose access (which is cost-prohibitive to anyone but the largest companies)
Hi,
For how long will the streaming api run and how can i stop it and view my data ?
It will run as long as the script is running. You can view your data at any time.
hey i am trying to run this code and getting an error mongo client module not found .could you please help me out with this ? i have installed the latest version of pymongo using pip .
Do you have Mongo installed?
Eric,
An excellent tutorial. The code works as it is. I am new to MongoDB and after running the code, I added two lines to the above script:
#See what databases are available under MongoDB
dbs = MongoClient().database_names()
dbs
Later, I fired up Compass tool of MongoDB and it listed the twitterdb database. Thank you.
Excellent. Glad to see it worked.
Hi,
this is simply superb!
Thank you.
Eric
Thanks for this tutorial. I am a lecturer at a UK uni and we are looking to start a new unit next year that will include this type of data analysis and we will be using mongodb. So perfect timing as we have to plan the unit now!
Glad to be of some service.
[…] just used a script that I found here. Indeed, the code is already working at it is, and I just needed to fill it with my information to […]
Hey! Thanks a lot for this. I am having a little issue. When I run it this is what it shows: Tracking: [‘#bigdata’, ‘#AI’, ‘#datascience’, ‘#machinelearning’, ‘#ml’, ‘#iot’] You are now connected to the streaming API. Tweet collected at Thu Apr 12 06:09:36 +0000 2018 localhost:27017: [WinError 10061] No connection could be made because the target machine actively refused it Tweet collected at Thu Apr 12 06:09:36 +0000 2018 localhost:27017: [WinError 10061] No connection could be made because the target machine actively refused it Tweet collected at Thu Apr 12 06:09:36 +0000 2018 localhost:27017: [WinError 10061] No connection could be… Read more »
That most likely means that your mongoDB isn’t configured correctly.
Dr. Brown, I’m using the tweet text to do sentiment analysis.
#grab the tweet text’ data from the tweet to use for vader sentiment analayser
text = datajson[‘text’]
analyzer = SentimentIntensityAnalyzer()
vs = analyzer.polarity_scores(text)
I want to insert this additional text into the same document of each tweet, I can’t seem to understand how to insert in the same document. can you help out ?
I’m not sure I understand the question.
What additional text are you trying to insert?
Hey Eric,
Nice tutorial. I just realized that “created_at” is stored as a string in MongoDB.
Seems to have following format: %a %b %d %H:%M:%S +0000 %Y
I was thinking about strptime in python, but isn’t there a way to do it in MongoDB as well?
Best Wuff
When you insert by column you can set a format. For example, you could use something like this: db[‘created_at’].insert({“date” : d}). That would require you to convert from inserting the fully json data and insert by column.
Hi, I’m working on a project of sentiment analysis based on tweets analysis and I’m looking for the right set of tools to build a online dashboard, accessible for my clients. For now I just store daily batches on a AWS S3 and access them with python to make the pre-processing & modelling. Would you say a Mongodb storage and Elastic search + Kibana running on top of it is a good approach? If so, (I have no experience with mongodb) should I store the raw data in a first collection just as you did in this demo and create… Read more »
Hi JB –
My approach to this is to handle all the modeling / processing in the back-end with python and store the outputs of the processing in a database (I use MySQL). I then present the data via Flask / JQuery.
I don’t know exactly what you are doing and/or if you need real-time presentation on the front-end. If you don’t need real-time, you could take a similar approach.
Feel free to ping with more questions…happy to help if I can.
[…] http://142.93.195.192/collecting-storing-tweets-with-python-and-mongodb/ […]
Hi all, how do i incorporate language translation using goslate for ‘text’ and a new field call text1 if the lang field is not ‘en’?
I’d collect the tweet, store it then run goslate on the text field and translate it (and store it into a new field).
First of all, i would like to appreciate your effort to provide such a simple but elegant code. But i have a question. I stored tweets successfully in twitter_search collection in twitterdb Database. Then when i try to collect only those tweets whose “created _at” lies between a range for example, from_date=”Fri Mar 29 10:03:37 +0000 2019″ to
to_date=”Fri Mar 29 08:26:08 +0000 2019″ i could not get these tweets. Can you please Explain ?
I really don’t know what to say here. If the data is stored in Mongodb, you just need to run some queuries against the correct collection to get the data back out. I really can’t help with specific queries to get select data from the database.
Hi Mr. Brown, many thanks for your post, after many attemps I connected MongoDb with your script in Jupyter. Now I have data stored and I’m ready to analyze them.
The point is, If I need to execute the script, example each 12h to collect tweets, (in my project I need to get hashtags, keywords in a specific topic), what is the second step to create aspecific rule to do it ? I’m reading other comments and I saw “Flask” or web app?
Thanks for to clue and your help.
Cheers
marcusRB
If you are asking how to run the twitter collection script regularly, you would just run a cron job (if on linux) to have the collection script kickoff and run. You could also build some functionality into the script to ‘always run’ and then just collect the tweets every X hours.
Where does it store the tweets? I cannot find them. Sorry, i’m totally new in this.
The tweets are stored in the Mongo DB. You’ll need to run access the tweets using standard mongo db access capabilities.