Month: February 2019

Quick Tip – Speed up Pandas using Modin

I ran across a neat little library called Modin recently that claims to run pandas faster. The one line sentence that they use to describe the project is:

Speed up your Pandas workflows by changing a single line of code

Interesting…and important if true.

Using modin only requires importing modin instead of pandas and thats it…no other changes required to your existing code.

One caveat – modin currently uses pandas 0.20.3 (at least it installs pandas 0.20. when modin is installed with pip install modin). If you’re using the latest version of pandas and need functionality that doesn’t exist in previous versions, you might need to wait on checking out modin – or play around with trying to get it to work with the latest version of pandas (I haven’t done that yet).

To install modin:

To use modin:

That’s it.  Rather than import pandas as pd you import modin.pandas as pd and you get all the advantages of additional speed.

read_csv_benchmark from Modin
A Read CSV Benchmark provided by Modin

According to the documentation, modin takes advantage of multi-cores on modern machines, which pandas does not do. From their website:

In pandas, you are only able to use one core at a time when you are doing computation of any kind. With Modin, you are able to use all of the CPU cores on your machine. Even in read_csv, we see large gains by efficiently distributing the work across your entire machine.

Let’s give is a shot and see how it works.

For this test, I’m going to try out their read_csv method since its something they highlight. For this test, I have a 105 MB csv file. Lets time both pandas and modin and see how things work.

We’ll start with pandas.

With pandas, it seems to take – on average – 1.26 seconds to read a 105MB csv file.

Now, lets take a look at modin.

Before continuing, I should share that I had to do a couple extra steps to get modin to work beyond just pip install modin. I had to install typing and dask as well.

Using the exact same code as above (except one minor change to import modin — import modin.pandas as pd.

With modin, it seems to take – on average – 0.96 seconds to read a 105MB csv file.

Using modin – in this example – I was able to shave off 0.3 seconds from the average read time for reading in that 105MB csv file. That may not seem like a lot of time, but it is a savings of around 27%. Just imagine if you’ve got 5000 csv files to read in that are of similar size, that’s a savings of 1500 seconds on average…that’s 25 minutes of time saved in just reading files.

Modin uses Ray to speed pandas up, so there could be even more savings if you get in and play around with some of the settings of Ray.

I’ll be looking at modin more in the future to use in some of my projects to help gain some efficiencies.  Take a look at it and let me know what you think.