Tag: mysql

Quick Tip: SQLAlchemy for MySQL and Pandas

SQLAlchemy LogoFor years I’ve used the mysql-python library for connecting to mysql databases.  It’s worked well for me over the years but there are times when you need speed and/or better connection management that what you get with mysql-python.  That’s where SQLAlchemy comes in.

Before diving into this, if you are doing things that aren’t dependent on speed (e.g., it doesn’t matter if it takes 1 second to connect to the database and grab your data and close the database) then you can easily ignore this tip. That said, if you have multiple connections, that connect time can add up.

For example, I recently had an issue where it was taking 4.5+ seconds to connect to a database, run analysis and spit out the results. That’s not terrible if its something for you only but if its a production system and speed is a requirement, that might be too long (and it IS too long).

When I did some analysis using python’s timer() I found that more than 50% of that 4.5 seconds time was in establishing database connections so I grabbed my trusty SQLAlchemy toolkit and went to work.

For those of you that don’t know, SQLAlchemy is a ‘python SQL toolkit and Object Relational Mapper’ (ORM) that is supposed to make things easier when working with SQL databases. For me, the ORM aspect tends to make things more difficult so I tend to stick with plain SQL queries but the SQL toolkit aspect of SQLAlchemy makes a lot of sense and add some time savings when connecting to a SQL database.

Before we get into the SQLAlchemy aspects, let’s take a second to look at how to connect to a SQL database with the mysql-python connector (or at least take a look at how I do it).

First, let’s setup our import statements. For this, we will import MySQLdb, pandas and pandas.io.sql in order to read SQL data directly into a pandas dataframe.

import pandas as pd
import MySQLdb
import pandas.io.sql as psql

Next, let’s create a database connection, create a query, execute that query and close that database.

# setup the database connection.  There's no need to setup cursors with pandas psql.
db=MySQLdb.connect(host=HOST, user=USER, passwd=PW, db=DBNAME)
# create the query
query = "select * from TABLENAME"
# execute the query and assign it to a pandas dataframe
df = psql.read_sql(query, con=db)
# close the database connection

This is a fairly standard approach to reading data into a pandas dataframe from mysql using mysql-python.  This approach is what I had been using before when I was getting 4.5+ seconds as discussed above. Note – there were multiple database calls and some analysis included in that 4.5+ seconds. A basic database call like the above ran in approximately 0.45 seconds in my code that I was trying to improve performance on and establishing the database connection was the majority of that time.

 To improve performance – especially if you will have multiple calls to multiple tables, you can use SQLAlchemy with pandas.   You’ll need to pip install sqlalchemy if you don’t have it installed already. Now, let’s setup our imports:

import pandas as pd
import sqlalchemy as sql

Now you can setup your connection string to your database for SQLAlchemy, you’d put everything together like the following:

connect_string = 'mysql://USER:PW@DBHOST/DB'

where USER is your username, PW is your password, DBHOST is the database host and  DB is the database you want to connect to.

To setup the persistent connection, you do the following:

sql_engine = sql.create_engine(connect_string)

Now, you have a connection to your database and you’re ready to go. No need to worry about cursors or opening/closing database connections. SQLAlchemy keeps the connection management aspects in for you.

Now all you need to do is focus on your SQL queries and loading the results into a pandas dataframe.

query =query = "select * from TABLENAME"
df = pd.read_sql_query(query, sql_engine)

That’s all it takes.  AND…it’s faster.  In the example above, my database setup / connection / query / closing times dropped from 0.45 seconds to 0.15 seconds.  Times will vary based on what data you are querying and where the database is of course but in this case, all things were the same except for mysql-python being replaced with SQLAlchemy and using the new(ish) read_sql_query function in pandas.

Using this approach, the 4.5+ seconds it took to grab data, analyze the data and return the data was reduced to about 1.5 seconds. Impressive gains for just switching out the connection/management method.

Collecting / Storing Tweets with Python and MySQL

A few days ago, I published Collecting / Storing Tweets with Python and MongoDB. In that post, I describe the steps needed to collect and store tweets gathered via the Twitter Streaming API.

I received a comment on that post asking how to store data into MySQL instead of MongoDB. Here’s what you’d need to do to make that change.

Collecting / Storing Tweets with Python and MySQL

In the previous post, we store the twitter data into MongoDB with one line of code:


Unlike MongoDB where we can insert the entire json object, if we want to use MySQL instead of MongoDB, we need to do some additional work with the ‘datajson’ object before storing it.

Let’s assume that we are interested in just capturing the username, date, Tweet and Tweet ID from twitter.  This is most likely the bare minimum information you’d want to capture from the API…there are many (many) more fields available but for now, we’ll go with these four.

Note: I’ll hold off on the MySQL specific changes for now and touch on them shortly.

Once you capture the tweet (line 38 in my original script) and have it stored in your datajson object, create a few variables to store the date, username, Tweet and ID.

text = datajson['text']
screen_name = datajson['user']['screen_name']
tweet_id = datajson['id']
created_at = parser.parse(datajson['created_at'])

Note in the above that we are using the parser.parse() command from the dateutil  module to parse the created_at date for storage into Mysql.

Now that we have our variables ready to go, we’ll need to store those variables into MySQL.  Before we can do that, we need to set up a MySQL connection. I use the python-mysql connector but you are free to use what you need to.  You’ll need to do a import MySQLdb  to get the connector imported into your script (and assuming you installed the connector with pip install mysql-python.

You’ll need to create a table to store this data. You can use the sql statement below to do so if you need assistance / guidance.

CREATE TABLE `twitter` (
 `tweet_id` varchar(250) DEFAULT NULL,
 `screen_name` varchar(128) DEFAULT NULL,
 `created_at` timestamp NULL DEFAULT NULL,
 `text` text,

Now, let’s set up our MySQL connection, query and execute/commit for the script. I’ll use a function for this to be able to re-use it for each tweet captured.

def store_data(created_at, text, screen_name, tweet_id):
    db=MySQLdb.connect(host=HOST, user=USER, passwd=PASSWD, db=DATABASE, charset="utf8")
    cursor = db.cursor()
    insert_query = "INSERT INTO twitter (tweet_id, screen_name, created_at, text) VALUES (%s, %s, %s, %s)"
    cursor.execute(insert_query, (tweet_id, screen_name, created_at, text))

That’s it.  You are now collecting tweets and storing those tweets into a MySQL database.

Full Script

from __future__ import print_function
import tweepy
import json
import MySQLdb 
from dateutil import parser
WORDS = ['#bigdata', '#AI', '#datascience', '#machinelearning', '#ml', '#iot']
# This function takes the 'created_at', 'text', 'screen_name' and 'tweet_id' and stores it
# into a MySQL database
def store_data(created_at, text, screen_name, tweet_id):
    db=MySQLdb.connect(host=HOST, user=USER, passwd=PASSWD, db=DATABASE, charset="utf8")
    cursor = db.cursor()
    insert_query = "INSERT INTO twitter (tweet_id, screen_name, created_at, text) VALUES (%s, %s, %s, %s)"
    cursor.execute(insert_query, (tweet_id, screen_name, created_at, text))
class StreamListener(tweepy.StreamListener):    
    #This is a class provided by tweepy to access the Twitter Streaming API. 
    def on_connect(self):
        # Called initially to connect to the Streaming API
        print("You are now connected to the streaming API.")
    def on_error(self, status_code):
        # On error - if an error occurs, display the error / status code
        print('An Error has occured: ' + repr(status_code))
        return False
    def on_data(self, data):
        #This is the meat of the script...it connects to your mongoDB and stores the tweet
           # Decode the JSON from Twitter
            datajson = json.loads(data)
            #grab the wanted data from the Tweet
            text = datajson['text']
            screen_name = datajson['user']['screen_name']
            tweet_id = datajson['id']
            created_at = parser.parse(datajson['created_at']) 
            #print out a message to the screen that we have collected a tweet
            print("Tweet collected at " + str(created_at))
            #insert the data into the MySQL database
            store_data(created_at, text, screen_name, tweet_id)
        except Exception as e:
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
#Set up the listener. The 'wait_on_rate_limit=True' is needed to help with Twitter API rate limiting.
listener = StreamListener(api=tweepy.API(wait_on_rate_limit=True)) 
streamer = tweepy.Stream(auth=auth, listener=listener)
print("Tracking: " + str(WORDS))


MySQL for PythonWant more information on working with MySQL and Python? Check out the book titled MySQL for Python by Albert Lukaszewski.