Tag: pandas

Quick Tip: SQLAlchemy for MySQL and Pandas

SQLAlchemy LogoFor years I’ve used the mysql-python library for connecting to mysql databases.  It’s worked well for me over the years but there are times when you need speed and/or better connection management that what you get with mysql-python.  That’s where SQLAlchemy comes in.

Before diving into this, if you are doing things that aren’t dependent on speed (e.g., it doesn’t matter if it takes 1 second to connect to the database and grab your data and close the database) then you can easily ignore this tip. That said, if you have multiple connections, that connect time can add up.

For example, I recently had an issue where it was taking 4.5+ seconds to connect to a database, run analysis and spit out the results. That’s not terrible if its something for you only but if its a production system and speed is a requirement, that might be too long (and it IS too long).

When I did some analysis using python’s timer() I found that more than 50% of that 4.5 seconds time was in establishing database connections so I grabbed my trusty SQLAlchemy toolkit and went to work.

For those of you that don’t know, SQLAlchemy is a ‘python SQL toolkit and Object Relational Mapper’ (ORM) that is supposed to make things easier when working with SQL databases. For me, the ORM aspect tends to make things more difficult so I tend to stick with plain SQL queries but the SQL toolkit aspect of SQLAlchemy makes a lot of sense and add some time savings when connecting to a SQL database.

Before we get into the SQLAlchemy aspects, let’s take a second to look at how to connect to a SQL database with the mysql-python connector (or at least take a look at how I do it).

First, let’s setup our import statements. For this, we will import MySQLdb, pandas and pandas.io.sql in order to read SQL data directly into a pandas dataframe.

import pandas as pd
import MySQLdb
import pandas.io.sql as psql

Next, let’s create a database connection, create a query, execute that query and close that database.

# setup the database connection.  There's no need to setup cursors with pandas psql.
db=MySQLdb.connect(host=HOST, user=USER, passwd=PW, db=DBNAME)
# create the query
query = "select * from TABLENAME"
# execute the query and assign it to a pandas dataframe
df = psql.read_sql(query, con=db)
# close the database connection
db.close()

This is a fairly standard approach to reading data into a pandas dataframe from mysql using mysql-python.  This approach is what I had been using before when I was getting 4.5+ seconds as discussed above. Note – there were multiple database calls and some analysis included in that 4.5+ seconds. A basic database call like the above ran in approximately 0.45 seconds in my code that I was trying to improve performance on and establishing the database connection was the majority of that time.

 To improve performance – especially if you will have multiple calls to multiple tables, you can use SQLAlchemy with pandas.   You’ll need to pip install sqlalchemy if you don’t have it installed already. Now, let’s setup our imports:

import pandas as pd
import sqlalchemy as sql

Now you can setup your connection string to your database for SQLAlchemy, you’d put everything together like the following:

connect_string = 'mysql://USER:PW@DBHOST/DB'

where USER is your username, PW is your password, DBHOST is the database host and  DB is the database you want to connect to.

To setup the persistent connection, you do the following:

sql_engine = sql.create_engine(connect_string)

Now, you have a connection to your database and you’re ready to go. No need to worry about cursors or opening/closing database connections. SQLAlchemy keeps the connection management aspects in for you.

Now all you need to do is focus on your SQL queries and loading the results into a pandas dataframe.

query =query = "select * from TABLENAME"
df = pd.read_sql_query(query, sql_engine)

That’s all it takes.  AND…it’s faster.  In the example above, my database setup / connection / query / closing times dropped from 0.45 seconds to 0.15 seconds.  Times will vary based on what data you are querying and where the database is of course but in this case, all things were the same except for mysql-python being replaced with SQLAlchemy and using the new(ish) read_sql_query function in pandas.

Using this approach, the 4.5+ seconds it took to grab data, analyze the data and return the data was reduced to about 1.5 seconds. Impressive gains for just switching out the connection/management method.

pandas Cheat Sheet (via yhat)

pandas cheat sheetThe folks over at yhat just released a cheat sheet for pandas.  You can download the cheat sheet in PDF for here.

There’s a couple important functions that I use all the time missing from their cheat sheet (actually….there are a lot of things missing, but its a great starter cheat sheet).

A few things that I use all the time with pandas dataframes that are worth collecting in one place are provided below.

Renaming columns in a pandas dataframe:

df.rename(columns={'col1': 'Column_1', 'col2': 'Column_2'}, inplace=True)

Iterating over a pandas dataframe:

for index, row in df.iterrows():
 * DO STUFF

Splitting pandas dataframe into chunks:

The function plus the function call will split a pandas dataframe (or list for that matter) into NUM_CHUNKS chunks. I use this often when working with the multiprocessing libary.

# This function creates chunks and returns them
def chunkify(lst,n):
    return [ lst[i::n] for i in xrange(n) ]
chunks = chunkify(df, NUMCHUNKS)

Accessing the value of a specific cell:

This will give you the value of the last row’s “COLUMN” cell.  This may not be the ‘best’ way to do it, but it gets the value

df.COLUMN.tail(1).iloc[0]

Getting rows matching a condition:

The below will get all rows in a pandas dataframe that match the criteria.  In addition to finding equality, you can do all the logical operators.

df[df.COLUMN == Criteria]

Getting rows matching multiple conditions:

This gets rows that match a criteria in COLUMN1 and those that match another criteria in COLUMN2

 df[(df.COLUMN1 == Criteria) & (df.COLUMN2 == Criteria_2) ]

Getting the ‘next’ row of data in a pandas dataframe

I’m currently working with stock market trade data that is output from a backtesting engine (I’m working with backtrader currently) in a pandas dataframe.  The format of the ‘transcations’ data that is provided out of the backtesting engine is shown below.

amountpricevalue
date
2016-01-07 00:00:00+00:0079.017119195.33-15434.413883
2016-09-07 00:00:00+00:00-79.017119218.8417292.106354
2016-09-20 00:00:00+00:0082.217609214.41-17628.277649
2016-11-16 00:00:00+00:00-82.217609217.5617887.263119

The data provided gives four crucial pieces of information:

  • Date – The date of a transaction.
  • Amount – the number of shares purchased (positive number) or sold (negative number)  during the transaction.
  • Price – the price received or paid at the time of the sale.
  • Value – the cash value of the transaction.

Backtrader’s transactions dataframe is comprised of two rows make for one one transaction (the first is the ‘buy’ the second is the ‘sell).  For example, in the data above, the first two rows (Jan 7 2016 and Sept 7th 2016) are the ‘buy’ data and ‘sell’ data for one transaction. What I need to do with this data is transform it (using that term loosely) into one row of data for each transaction to store into  database for use in another analysis.

I could leave it in its current form, but I prefer to store transactions in one row when dealing with market backtests.

There are a few ways to attack this particular problem.  You could iterate over the dataframe and manually pick each row. That would be pretty straightforward, but not necessarily the best way.

While looking around the web for some pointers, I stumbled across this answer that does exactly what I need to do.   I added the following code to my script and — voila — I have my transactions transformed from two rows per transaction to one row.

trade_output=[]
from itertools import tee, izip
def next_row(iterable):
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)
for (buy_index, buy_data), (sell_index, sell_data) in next_row(transactions.iterrows()):
    if buy_data['amount']>0:
        trade_output.append((buy_index.date(), buy_data["amount"], buy_data['price'],\
                             sell_index.date(), sell_data["amount"], sell_data['price']))

Note: In the above, I only want to build rows that start with a positive amount in the ‘amount’ column because amount in the first row of a transactions is always positive. I then append each transformed transaction into an array to be used for more analysis down at a later point in the script.