Collecting / Storing Tweets with Python and MongoDB

A good amount of the work that I do involves using social media content for analyzing networks, sentiment, influencers and other various types of analysis.

In order to do this type of analysis, you first need to have some data to analyze.  You can also scrape websites like Twitter or Facebook using simple web scrapers, but I’ve always found it easier to use the API’s that these companies / websites provide to pull down data.

The Twitter Streaming API is ideal for grabbing data in real-time and storing it for analysis. Twitter also has a search API that lets you pull down a certain number of historical tweets (I think I read it was the last 1,000 tweets…but its been a while since I’ve looked at the Search API).   I’m a fan of the Streaming API because it lets me grab a much larger set of data than the Search API, but it requires you to build a script that ‘listens’ to the API for your required keywords and then store those tweets somewhere for later analysis.

There are tons of ways to connect up to the Streaming API. There are also quite a few Twitter API wrappers for Python (and most of them work very well).   I tend to use Tweepy more than others due to its ease of use and simple structure. Additionally, if I’m working on a small / short-term project, I tend to reach for MongoDB to store the tweets using the PyMongo module. For larger / longer-term projects I usually connect the streaming API script to MySQL instead of MongoDB simply because MySQL fits into my ecosystem of backup scripts, etc better than MongoDB does.  MongoDB is perfectly suited for this type of work for larger projects…I just tend to swing toward MySQL for those projects.

For this post, I wanted to share my script for collecting Tweets from the Twitter API and storing them into MongoDB.

Note: This script is a mashup of many other scripts I’ve found on the web over the years. I don’t recall where I found the pieces/parts of this script but I don’t want to discount the help I had from other people / sites in building this script.

Collecting / Storing Tweets with Python and MongoDB

Let’s set up our imports:

Next, set up your mongoDB path:

Next, set up the words that you want to ‘listen’ for on Twitter. You can use words or phrases seperated by commas.

Here, I’m listening for words related to maching learning, data science, etc.

Next, let’s set up our Twitter API Access information.  You can set these up here.

Time to build the listener class.

Now that we have the listener class, let’s set everything up to start listening.

Now you are ready to go. The full script is below. You can store this script as “streaming_API.py” and run it as “python streaming_API.py” and – assuming you set up mongoDB and your twitter API key’s correctly, you should start collecting Tweets.

The Full Script:

 

42
Leave a Reply

avatar
23 Comment threads
19 Thread replies
0 Followers
 
Most reacted comment
Hottest comment thread
22 Comment authors
Eric D. Brown, D.Sc.Pablo Augusto Correa CausamarcusRBSaswata RoyEric Brown Recent comment authors
  Subscribe  
newest oldest most voted
Notify of
Haytham Amin
Guest
Haytham Amin

Can you share a MYSQL implementation of the same flow please?

Rachel Solomon
Guest
Rachel Solomon

Hi, was just wondering if you wanted to look at the data collected, how would you query it with the find() method? Or is there another way entirely?
Great post btw, very helpful!

gunner
Guest

Hi,
I did collect and insert the data into mongodb. I then transfered the data from mongodb to a json file. How, the file seems corrupt since it can be open in Python notebook. the ERROR is as follow:

JSONDecodeError: Extra data: line 2 column 1 (char 3656)

sylvain
Guest
sylvain

When i run the program, it is said that “an integer is required” ! I can’t fix this problem. Any help?

Tracking: [‘#bigdata’, ‘#AI’, ‘#datascience’, ‘#machinelearning’, ‘#ml’, ‘#iot’]
You are now connected to the streaming API.
Tweet collected at 2017-04-29 12:56:22+00:00
an integer is required
Tweet collected at 2017-04-29 12:56:23+00:00
an integer is required

paks
Guest
paks

Hi. I’d like to know if you’ve tried continuously querying data from the db while the stream is still runnning.

I’m currently working on a trend detection model and I’m also working with mongodb as my database and the python module tweepy.

Rebecca Cunningham
Guest
Rebecca Cunningham

Hi Eric, Thank you this is very helpful. I’m trying my hardest to get a twitter api streaming through into a mongodb. In regards to setting up the mongo db, do I have to tell it how to collate things – for example the twitter columns? I thought as it was dragging in json files it would automatically put time, date, user name, ip, tweet, geo location etc into a string and then would columnise automatically? is that the case?
Thank you in advance for your advice.
best wishes
Rebecca

Adekunle Babatunde
Guest
Adekunle Babatunde

Hi Eric, great post. I have a question. If I am going to adapt this methodology to create a huge project, that seeks to use big data i.e tweets, blog posts and co. in a data lake, – I will want to continually store the metadata as regards a twitter search – would you advise saving the search directly into MongoDB or I should use a JSON file format to save the data initially and pick out the information I care about per-time into a Database probably MongoDB too or Mysql. I am also thinking that if I am going… Read more »

Divya
Guest
Divya

Why does this code only collects the data of past 8hours or so.
What if I want to collect the tweets in backward fashion that is the manner in which we see on twitter.

Murtaza
Guest
Murtaza

Hi,
For how long will the streaming api run and how can i stop it and view my data ?

hurriat
Guest
hurriat

hey i am trying to run this code and getting an error mongo client module not found .could you please help me out with this ? i have installed the latest version of pymongo using pip .

Nadkalpur Manjunath
Guest

Eric,
An excellent tutorial. The code works as it is. I am new to MongoDB and after running the code, I added two lines to the above script:
#See what databases are available under MongoDB
dbs = MongoClient().database_names()
dbs
Later, I fired up Compass tool of MongoDB and it listed the twitterdb database. Thank you.

Jude Kelvin
Guest
Jude Kelvin

Hi,
this is simply superb!
Thank you.

Mark Venn
Guest
Mark Venn

Eric
Thanks for this tutorial. I am a lecturer at a UK uni and we are looking to start a new unit next year that will include this type of data analysis and we will be using mongodb. So perfect timing as we have to plan the unit now!

trackback

[…] just used a script that I found here. Indeed, the code is already working at it is, and I just needed to fill it with my information to […]

Jainil
Guest
Jainil

Hey! Thanks a lot for this. I am having a little issue. When I run it this is what it shows: Tracking: [‘#bigdata’, ‘#AI’, ‘#datascience’, ‘#machinelearning’, ‘#ml’, ‘#iot’] You are now connected to the streaming API. Tweet collected at Thu Apr 12 06:09:36 +0000 2018 localhost:27017: [WinError 10061] No connection could be made because the target machine actively refused it Tweet collected at Thu Apr 12 06:09:36 +0000 2018 localhost:27017: [WinError 10061] No connection could be made because the target machine actively refused it Tweet collected at Thu Apr 12 06:09:36 +0000 2018 localhost:27017: [WinError 10061] No connection could be… Read more »

Miguel
Guest
Miguel

Dr. Brown, I’m using the tweet text to do sentiment analysis.

#grab the tweet text’ data from the tweet to use for vader sentiment analayser
text = datajson[‘text’]
analyzer = SentimentIntensityAnalyzer()
vs = analyzer.polarity_scores(text)

I want to insert this additional text into the same document of each tweet, I can’t seem to understand how to insert in the same document. can you help out ?

Bbo Wuff
Guest
Bbo Wuff

Hey Eric,

Nice tutorial. I just realized that “created_at” is stored as a string in MongoDB.
Seems to have following format: %a %b %d %H:%M:%S +0000 %Y

I was thinking about strptime in python, but isn’t there a way to do it in MongoDB as well?

Best Wuff

JB
Guest
JB

Hi, I’m working on a project of sentiment analysis based on tweets analysis and I’m looking for the right set of tools to build a online dashboard, accessible for my clients. For now I just store daily batches on a AWS S3 and access them with python to make the pre-processing & modelling. Would you say a Mongodb storage and Elastic search + Kibana running on top of it is a good approach? If so, (I have no experience with mongodb) should I store the raw data in a first collection just as you did in this demo and create… Read more »

Tony
Guest
Tony

Hi all, how do i incorporate language translation using goslate for ‘text’ and a new field call text1 if the lang field is not ‘en’?

Saswata Roy
Guest
Saswata Roy

First of all, i would like to appreciate your effort to provide such a simple but elegant code. But i have a question. I stored tweets successfully in twitter_search collection in twitterdb Database. Then when i try to collect only those tweets whose “created _at” lies between a range for example, from_date=”Fri Mar 29 10:03:37 +0000 2019″ to
to_date=”Fri Mar 29 08:26:08 +0000 2019″ i could not get these tweets. Can you please Explain ?

marcusRB
Guest

Hi Mr. Brown, many thanks for your post, after many attemps I connected MongoDb with your script in Jupyter. Now I have data stored and I’m ready to analyze them.
The point is, If I need to execute the script, example each 12h to collect tweets, (in my project I need to get hashtags, keywords in a specific topic), what is the second step to create aspecific rule to do it ? I’m reading other comments and I saw “Flask” or web app?

Thanks for to clue and your help.
Cheers
marcusRB

Pablo Augusto Correa Causa
Guest
Pablo Augusto Correa Causa

Where does it store the tweets? I cannot find them. Sorry, i’m totally new in this.