Quick Tip – Speed up Pandas using Modin

I ran across a neat little library called Modin recently that claims to run pandas faster. The one line sentence that they use to describe the project is:

Speed up your Pandas workflows by changing a single line of code

Interesting…and important if true.

Using modin only requires importing modin instead of pandas and thats it…no other changes required to your existing code.

One caveat – modin currently uses pandas 0.20.3 (at least it installs pandas 0.20. when modin is installed with pip install modin). If you’re using the latest version of pandas and need functionality that doesn’t exist in previous versions, you might need to wait on checking out modin – or play around with trying to get it to work with the latest version of pandas (I haven’t done that yet).

To install modin:

To use modin:

That’s it.  Rather than import pandas as pd you import modin.pandas as pd and you get all the advantages of additional speed.

read_csv_benchmark from Modin
A Read CSV Benchmark provided by Modin

According to the documentation, modin takes advantage of multi-cores on modern machines, which pandas does not do. From their website:

In pandas, you are only able to use one core at a time when you are doing computation of any kind. With Modin, you are able to use all of the CPU cores on your machine. Even in read_csv, we see large gains by efficiently distributing the work across your entire machine.

Let’s give is a shot and see how it works.

For this test, I’m going to try out their read_csv method since its something they highlight. For this test, I have a 105 MB csv file. Lets time both pandas and modin and see how things work.

We’ll start with pandas.

With pandas, it seems to take – on average – 1.26 seconds to read a 105MB csv file.

Now, lets take a look at modin.

Before continuing, I should share that I had to do a couple extra steps to get modin to work beyond just pip install modin. I had to install typing and dask as well.

Using the exact same code as above (except one minor change to import modin — import modin.pandas as pd.

With modin, it seems to take – on average – 0.96 seconds to read a 105MB csv file.

Using modin – in this example – I was able to shave off 0.3 seconds from the average read time for reading in that 105MB csv file. That may not seem like a lot of time, but it is a savings of around 27%. Just imagine if you’ve got 5000 csv files to read in that are of similar size, that’s a savings of 1500 seconds on average…that’s 25 minutes of time saved in just reading files.

Modin uses Ray to speed pandas up, so there could be even more savings if you get in and play around with some of the settings of Ray.

I’ll be looking at modin more in the future to use in some of my projects to help gain some efficiencies.  Take a look at it and let me know what you think.

Quick Tip: Comparing two pandas dataframes and getting the differences

There are times when working with different pandas dataframes that you might need to get the data that is ‘different’ between the two dataframes (i.e.,g Comparing two pandas dataframes and getting the differences). This seems like a straightforward issue, but apparently its still a popular ‘question’ for many people and is my most popular question on stackoverflow.

As an example, let’s look at two pandas dataframes. Both have date indexes and the same structure. How can we compare these two dataframes and find which rows are in dataframe 2 that aren’t in dataframe 1?

dataframe 1 (named df1):

dataframe 2 (named df2):

The answer, it seems, is quite simple – but I couldn’t figure it out at the time.  Thanks to the generosity of stackoverflow users, the answer (or at least an answer that works) is simply to concat the dataframes then perform a group-by via columns and finally re-index to get the unique records based on the index.

Here’s the code (as provided by user alko on stackoverlow):

This simple approach leads to the correct answer:

There are most likely more ‘pythonic’ answers (one suggestion is here) and I’d recommend you dig into those other approaches, but the above works, is easy to read and is  fast enough for my needs.

Want more information about pandas for data analysis? Check out the book Python for Data Analysis by the creator of pandas, Wes McKinney.

Quick Tip: SQLAlchemy for MySQL and Pandas

SQLAlchemy LogoFor years I’ve used the mysql-python library for connecting to mysql databases.  It’s worked well for me over the years but there are times when you need speed and/or better connection management that what you get with mysql-python.  That’s where SQLAlchemy comes in.

Before diving into this, if you are doing things that aren’t dependent on speed (e.g., it doesn’t matter if it takes 1 second to connect to the database and grab your data and close the database) then you can easily ignore this tip. That said, if you have multiple connections, that connect time can add up.

For example, I recently had an issue where it was taking 4.5+ seconds to connect to a database, run analysis and spit out the results. That’s not terrible if its something for you only but if its a production system and speed is a requirement, that might be too long (and it IS too long).

When I did some analysis using python’s timer() I found that more than 50% of that 4.5 seconds time was in establishing database connections so I grabbed my trusty SQLAlchemy toolkit and went to work.

For those of you that don’t know, SQLAlchemy is a ‘python SQL toolkit and Object Relational Mapper’ (ORM) that is supposed to make things easier when working with SQL databases. For me, the ORM aspect tends to make things more difficult so I tend to stick with plain SQL queries but the SQL toolkit aspect of SQLAlchemy makes a lot of sense and add some time savings when connecting to a SQL database.

Before we get into the SQLAlchemy aspects, let’s take a second to look at how to connect to a SQL database with the mysql-python connector (or at least take a look at how I do it).

First, let’s setup our import statements. For this, we will import MySQLdb, pandas and pandas.io.sql in order to read SQL data directly into a pandas dataframe.

Next, let’s create a database connection, create a query, execute that query and close that database.

This is a fairly standard approach to reading data into a pandas dataframe from mysql using mysql-python.  This approach is what I had been using before when I was getting 4.5+ seconds as discussed above. Note – there were multiple database calls and some analysis included in that 4.5+ seconds. A basic database call like the above ran in approximately 0.45 seconds in my code that I was trying to improve performance on and establishing the database connection was the majority of that time.

 To improve performance – especially if you will have multiple calls to multiple tables, you can use SQLAlchemy with pandas.   You’ll need to pip install sqlalchemy if you don’t have it installed already. Now, let’s setup our imports:

Now you can setup your connection string to your database for SQLAlchemy, you’d put everything together like the following:

where USER is your username, PW is your password, DBHOST is the database host and  DB is the database you want to connect to.

To setup the persistent connection, you do the following:

Now, you have a connection to your database and you’re ready to go. No need to worry about cursors or opening/closing database connections. SQLAlchemy keeps the connection management aspects in for you.

Now all you need to do is focus on your SQL queries and loading the results into a pandas dataframe.

That’s all it takes.  AND…it’s faster.  In the example above, my database setup / connection / query / closing times dropped from 0.45 seconds to 0.15 seconds.  Times will vary based on what data you are querying and where the database is of course but in this case, all things were the same except for mysql-python being replaced with SQLAlchemy and using the new(ish) read_sql_query function in pandas.

Using this approach, the 4.5+ seconds it took to grab data, analyze the data and return the data was reduced to about 1.5 seconds. Impressive gains for just switching out the connection/management method.