A few days ago, I published Collecting / Storing Tweets with Python and MongoDB. In that post, I describe the steps needed to collect and store tweets gathered via the Twitter Streaming API.
I received a comment on that post asking how to store data into MySQL instead of MongoDB. Here’s what you’d need to do to make that change.
Collecting / Storing Tweets with Python and MySQL
In the previous post, we store the twitter data into MongoDB with one line of code:
db.twitter_search.insert(datajson)
Unlike MongoDB where we can insert the entire json object, if we want to use MySQL instead of MongoDB, we need to do some additional work with the ‘datajson’ object before storing it.
Let’s assume that we are interested in just capturing the username, date, Tweet and Tweet ID from twitter. This is most likely the bare minimum information you’d want to capture from the API…there are many (many) more fields available but for now, we’ll go with these four.
Note: I’ll hold off on the MySQL specific changes for now and touch on them shortly.
Once you capture the tweet (line 38 in my original script) and have it stored in your datajson object, create a few variables to store the date, username, Tweet and ID.
text = datajson['text'] screen_name = datajson['user']['screen_name'] tweet_id = datajson['id'] created_at = parser.parse(datajson['created_at'])
Note in the above that we are using the parser.parse() command from the dateutil module to parse the created_at date for storage into Mysql.
Now that we have our variables ready to go, we’ll need to store those variables into MySQL. Before we can do that, we need to set up a MySQL connection. I use the python-mysql connector but you are free to use what you need to. You’ll need to do a import MySQLdb to get the connector imported into your script (and assuming you installed the connector with pip install mysql-python.
You’ll need to create a table to store this data. You can use the sql statement below to do so if you need assistance / guidance.
CREATE TABLE `twitter` ( `id` int(11) NOT NULL AUTO_INCREMENT, `tweet_id` varchar(250) DEFAULT NULL, `screen_name` varchar(128) DEFAULT NULL, `created_at` timestamp NULL DEFAULT NULL, `text` text, PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=56 DEFAULT CHARSET=utf8;
Now, let’s set up our MySQL connection, query and execute/commit for the script. I’ll use a function for this to be able to re-use it for each tweet captured.
def store_data(created_at, text, screen_name, tweet_id): db=MySQLdb.connect(host=HOST, user=USER, passwd=PASSWD, db=DATABASE, charset="utf8") cursor = db.cursor() insert_query = "INSERT INTO twitter (tweet_id, screen_name, created_at, text) VALUES (%s, %s, %s, %s)" cursor.execute(insert_query, (tweet_id, screen_name, created_at, text)) db.commit() cursor.close() db.close() return
That’s it. You are now collecting tweets and storing those tweets into a MySQL database.
Full Script
from __future__ import print_function import tweepy import json import MySQLdb from dateutil import parser WORDS = ['#bigdata', '#AI', '#datascience', '#machinelearning', '#ml', '#iot'] CONSUMER_KEY = "KEY" CONSUMER_SECRET = "SECRET" ACCESS_TOKEN = "TOKEN" ACCESS_TOKEN_SECRET = "TOKEN_SECRET" HOST = "YOUR_DATABASE_HOST" USER = "YOUR_DATABASE_USER" PASSWD = "YOUR_DATABASE_PASSWORD" DATABASE = "YOUR_DATABASE" # This function takes the 'created_at', 'text', 'screen_name' and 'tweet_id' and stores it # into a MySQL database def store_data(created_at, text, screen_name, tweet_id): db=MySQLdb.connect(host=HOST, user=USER, passwd=PASSWD, db=DATABASE, charset="utf8") cursor = db.cursor() insert_query = "INSERT INTO twitter (tweet_id, screen_name, created_at, text) VALUES (%s, %s, %s, %s)" cursor.execute(insert_query, (tweet_id, screen_name, created_at, text)) db.commit() cursor.close() db.close() return class StreamListener(tweepy.StreamListener): #This is a class provided by tweepy to access the Twitter Streaming API. def on_connect(self): # Called initially to connect to the Streaming API print("You are now connected to the streaming API.") def on_error(self, status_code): # On error - if an error occurs, display the error / status code print('An Error has occured: ' + repr(status_code)) return False def on_data(self, data): #This is the meat of the script...it connects to your mongoDB and stores the tweet try: # Decode the JSON from Twitter datajson = json.loads(data) #grab the wanted data from the Tweet text = datajson['text'] screen_name = datajson['user']['screen_name'] tweet_id = datajson['id'] created_at = parser.parse(datajson['created_at']) #print out a message to the screen that we have collected a tweet print("Tweet collected at " + str(created_at)) #insert the data into the MySQL database store_data(created_at, text, screen_name, tweet_id) except Exception as e: print(e) auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET) auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET) #Set up the listener. The 'wait_on_rate_limit=True' is needed to help with Twitter API rate limiting. listener = StreamListener(api=tweepy.API(wait_on_rate_limit=True)) streamer = tweepy.Stream(auth=auth, listener=listener) print("Tracking: " + str(WORDS)) streamer.filter(track=WORDS)
Want more information on working with MySQL and Python? Check out the book titled MySQL for Python by Albert Lukaszewski.
File “tweet.py”, line 60
except Exception as e:
^
IndentationError: unindent does not match any outer indentation level
Just add a space in line 60, other than that it works like magic.
Thanks
Thanks for the edit…always tough when you copy/paste into blogs. I’ve edited the code.
anytime, thanks for spreading the knowledge.
Also, can you tell me how to handle the geo tag? I want to store coordinates whenever available.
I added a geo tag, but I got:
(1064, ‘You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near \’: “\’Point\'”, u\’coordinates\’: \'(28.5383355,-81.3792365)\’})\’ at line 1′)
Thanks
You’ll need/want to escape some of the data.
Take a look at escape_string() in the mysql-python module. See here http://mysql-python.sourceforge.net/MySQLdb.html
I see the function mysql_escape_string(), but there is no explanation or examples for it. : ( sorry don’t mean to bother you
it can be used in a few ways, but the best way would be just to put your sql query into it before executing.
For example:
query = MySQLdb.escape_string(query)
db.execute(query)
couldnt figure it out 🙁
so where/when does this script actually stop running? I imagine it isnt filling up your database with a tonne of live tweets until your database can no longer run. Is there a certain amount of tweets it stores?
thanks
This script is set to run until you stop it manually.
expected an intended block
db=MySQLdb.connect(host=HOST, user=USER, passwd=PASSWD, db=DATABASE, charset=”utf8″)
I’m not quite sure what you are asking.
Did you mean “indendted” block? If so, you always need to make sure your code has the proper spacing (indentations) for python. When copy/pasting code, this can be an issue.
i have the same problm, did u solved it?
Its spacing. Python is very strict with spacing / tabs. Make sure everything has the right indentations.
ModuleNotFoundError: No module named ‘MySQLdb’
Awais – you need to install python-mysql. Please do a little bit of work on your own (google can be your friend).
Hi Eric,
Helpful blog, thanks.
I keep getting this error – tweet collected at 2017-02-20 19:16:36+00:00
(1292, “Incorrect datetime value: ‘2017-02-20 19:16:36+00:00’ for column ‘created_at’ at row 1”)
Would you be able to help me?
Hi Hans –
Looks like the column you are trying to save into on the mysql side isn’t set up correctly to store the date/time. What does your database table look like?
Hi Eric,
Thanks for your reply. MySQL DDL:
CREATE TABLE `twitter` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`tweet_id` varchar(250) DEFAULT NULL,
`screen_name` varchar(128) DEFAULT NULL,
`created_at` timestamp NULL DEFAULT NULL,
`text` text,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=73 DEFAULT CHARSET=utf8
I experienced with changing the ‘timestamp’ to ‘varchar(250)’. Then the script works. I think the problem is that script gives me back the following answer: tweet collected at 2017-03-08 10:11:11+00:00. The ‘+00:00’ which indicates the timezone is probably the problem for the SQL database? I can’t find a way to adjust this string.
What datatype would I use for the MySql create query if I wanted to use the coordinates, user, retweet, and retweet_count parameters? Would everything essentially be returned as a string that gets processed into JSON? Thanks in advance!
I would store coordinates, user and retweet as strings and retweet_count as an int. You could store everything as a string if you wanted and then handle the content in your script when you read it out if you wanted to.
Traceback (most recent call last):
I have a problem, i can’t insatll the module MySQLdb. I try to install it with the following comment : “pip install mysql-python”.
File “tweet_sql.py”, line 7, in
import MySQLdb
ImportError: No module named MySQLdb
ImportError: No module named MySQLdb
I resolved this on Centos6 using: yum install MySQL-python
Might be different install method depending on OS
Hi Eric,
I have used this as a base to for a twitter scraper, I am collecting alot more information from twitter than the four columns mentioned in this post.
For some reason when i run the scrip it just keeps printing ‘verified’ instead of printing the time the tweet was collected. Also none of the data is being inserted into the MySQL database i created to house the information. Any insight you have would be invaluable, thanks for this post regardless.
I just want to give a really big thanks to you, Dr. Brown – using your code, I was able to convert this to my own SQL database environment and get streaming tweets stored in SQL Server. Thanks so much!
Great to hear Kate. Glad to be of some assistance.
Love this. Started to use it again and just realised it only captures 140 characters and am wondering if there is a tweak to make the script retrieve 280 character tweets?
Great question Gort. I’m not sure about the new 280 character limit. That might be due to a few things: it could be a limitation of the streaming API itself, the library being used to call the API or there may be a differant API call that need to be made.
I did some research – something about setting “tweet_extended” and fetching “full_text” in the usual API but in streaming mode this applies differently: “tweet_mode=extended will have no effect in this code, since the Streaming API does not support that parameter. If a Tweet contains longer text, it will contain an additional object in the JSON response called extended_tweet, which will in turn contain a field called full_text.” From: https://stackoverflow.com/questions/48319243/tweepy-streaming-api-full-text I did try hacking for a few hours but it became obvious that how to extract the “full_text” field is beyond my coding knowledge… I also understand that “full_text” applies differently… Read more »
Hi, thanks a lot for your tutorial!
I have a question, when I run the file, i get a “You are now connected to the streaming API”. How much time does it take for the database to begin filling up? Or should I do something else?
Thank you in advance!
If everything is working, you should start seeing data coming in immediately if someone on twitter uses one of your keywords
Hi . Could you please suggest me , how to pass limit for tweet data
You can pay for it. That’s the only way.
Is it possible to add to the script give me the latest in the last X amount of time ( and so it wont bring duplicates ?)
Last time I looked, there wasn’t a feature in the twitter API to allow you to do that. Handling duplicates should be easy enough when storing into the database (with the proper table structures, etc)
Hi all, I wanted to thank Dr. Brown for this really useful and informative script. I’m pretty much a total novice at Python and SQL so dissecting something like this is very helpful in trying to figure out ‘how stuff works’. It seems mysqldb has not been updated in some time and does on run on Python3 (according to my own experience and some posts on stackoverflow). I used a package called ‘mysql-connector’, which has very similar methods to the mysqldb ones used in this script, in order to get this to work as intended. Just wanted to share that… Read more »
Hi Jules –
Good point. With Python 3, you’ll need to use mysql-connector. It requires very minor tweaks in setting up the connection but after the connection is setup, it works the same.