For years I’ve used the mysql-python library for connecting to mysql databases. It’s worked well for me over the years but there are times when you need speed and/or better connection management that what you get with mysql-python. That’s where SQLAlchemy comes in.
Before diving into this, if you are doing things that aren’t dependent on speed (e.g., it doesn’t matter if it takes 1 second to connect to the database and grab your data and close the database) then you can easily ignore this tip. That said, if you have multiple connections, that connect time can add up.
For example, I recently had an issue where it was taking 4.5+ seconds to connect to a database, run analysis and spit out the results. That’s not terrible if its something for you only but if its a production system and speed is a requirement, that might be too long (and it IS too long).
When I did some analysis using python’s timer()
I found that more than 50% of that 4.5 seconds time was in establishing database connections so I grabbed my trusty SQLAlchemy toolkit and went to work.
For those of you that don’t know, SQLAlchemy is a ‘python SQL toolkit and Object Relational Mapper’ (ORM) that is supposed to make things easier when working with SQL databases. For me, the ORM aspect tends to make things more difficult so I tend to stick with plain SQL queries but the SQL toolkit aspect of SQLAlchemy makes a lot of sense and add some time savings when connecting to a SQL database.
Before we get into the SQLAlchemy aspects, let’s take a second to look at how to connect to a SQL database with the mysql-python connector (or at least take a look at how I do it).
First, let’s setup our import statements. For this, we will import MySQLdb
, pandas
and pandas.io.sql
in order to read SQL data directly into a pandas dataframe.
import pandas as pd import MySQLdb import pandas.io.sql as psql
Next, let’s create a database connection, create a query, execute that query and close that database.
# setup the database connection. There's no need to setup cursors with pandas psql. db=MySQLdb.connect(host=HOST, user=USER, passwd=PW, db=DBNAME) # create the query query = "select * from TABLENAME" # execute the query and assign it to a pandas dataframe df = psql.read_sql(query, con=db) # close the database connection db.close()
This is a fairly standard approach to reading data into a pandas dataframe from mysql using mysql-python. This approach is what I had been using before when I was getting 4.5+ seconds as discussed above. Note – there were multiple database calls and some analysis included in that 4.5+ seconds. A basic database call like the above ran in approximately 0.45 seconds in my code that I was trying to improve performance on and establishing the database connection was the majority of that time.
To improve performance – especially if you will have multiple calls to multiple tables, you can use SQLAlchemy with pandas. You’ll need to pip install sqlalchemy
if you don’t have it installed already. Now, let’s setup our imports:
import pandas as pd import sqlalchemy as sql
Now you can setup your connection string to your database for SQLAlchemy, you’d put everything together like the following:
connect_string = 'mysql://USER:PW@DBHOST/DB'
where USER
is your username, PW
is your password, DBHOST
is the database host and DB
is the database you want to connect to.
To setup the persistent connection, you do the following:
sql_engine = sql.create_engine(connect_string)
Now, you have a connection to your database and you’re ready to go. No need to worry about cursors or opening/closing database connections. SQLAlchemy keeps the connection management aspects in for you.
Now all you need to do is focus on your SQL queries and loading the results into a pandas dataframe.
query =query = "select * from TABLENAME" df = pd.read_sql_query(query, sql_engine)
That’s all it takes. AND…it’s faster. In the example above, my database setup / connection / query / closing times dropped from 0.45 seconds to 0.15 seconds. Times will vary based on what data you are querying and where the database is of course but in this case, all things were the same except for mysql-python being replaced with SQLAlchemy and using the new(ish) read_sql_query
function in pandas.
Using this approach, the 4.5+ seconds it took to grab data, analyze the data and return the data was reduced to about 1.5 seconds. Impressive gains for just switching out the connection/management method.
[…] post Quick Tip: SQLAlchemy for MySQL and Pandas appeared first on Python […]
you may simplify the manual creation of sqlalchemy engine, by just passing the connect_string to pd
for example:
pd.read_sql(“select * from TABLENAME”, ‘mysql://USER:PW@DBHOST/DB’)
btw, when reading whole table, you may also just give the table name, example:
pd.read_sql(“TABLENAME”, ‘mysql://USER:PW@DBHOST/DB’)
I tend to set variables up rather than do things like this as the variables make things easier to re-use. This is a good tip though. Appreciate it.
I’ve found even greater performance gains can be realized by passing a SQLAlchemy Connection object to pd.read_sql() rather than passing in the engine.
Thanks Isaac. Good information.
[…] Quick Tip: SQLAlchemy for MySQL and Pandas […]
Hello Eric, thanks for the Quicktip.
What would be your approach to upsert (update+insert) a table in MySQL from a pandas data frame?
Can you set the relationships of a table and if required composite keys as a schema for the “to_sql” function?
Thanks in advance!
You just hit on the biggest issue I have with pandas…its inability to do much with sql databases. You can do the `to_sql()` to insert data but you can really only append, replace or fail. I would have hoped by now there would be better methods for that in pandas but I’ve not run across them. Note: to others – if you know of any changes that allow more than append, replace or fail – please do let me know. What I do is convert the pandas dataframe to a list and then insert the list and manage the upsert… Read more »
Hello Eric, I am using the sqlalchemy method “.connect().execute()” to execute the required SQL operations for the tables in the MYSQL database. Unfortunately, this process seems to be quite slow, even if you are dealing with small datasets. I assume it depends on the number of connection calls. You need at least 3: Pushing the data from pandas to MYSQL, upserting and pulling the data back into a pandas data frame. Is the ‘schema’ parameter a possibility to create more customized tables? I have not been able to find an example for now… to_sql(self, name, con, schema=None, if_exists=’fail’, index=True, index_label=None,… Read more »
Its my understanding that ‘schema’ is strictly to set the flavor of your database (e.g., mysql, postgres, etc). I don’t know of any way to use it to customize things.
why is everyone still using Python 2.x ?
If you want these examples to work under Python 3 you need do make some modifications, since for some – to me unknown – reasons, MySQLdb is not available under Python 3.
Stackexchange came to my help as usual, so far I had some success in connecting to a MySQL-server with this tip: https://stackoverflow.com/questions/14164183/python-3-and-mysql-through-sqlalchemy
“I was successful in getting Oracle’s MySQL connector for python working with SQLAlchemy on Python 3.3. Your connection string needs to start with “mysql+mysqlconnector://…”. After I changed my connection string everything (well, simple things) started working. “
Simple answer: Because many places are still using Python 2.x in production. End of life is coming up but at least half the companies I work with are still using 2.7.x in production in some form. Sure, in an ideal world we’d always run on the latest/greatest but technical debt exists and we aren’t in an ideal world.
Very helpful post! Thank you!