Quick Tip – Speed up Pandas using Modin

I ran across a neat little library called Modin recently that claims to run pandas faster. The one line sentence that they use to describe the project is:

Speed up your Pandas workflows by changing a single line of code

Interesting…and important if true.

Using modin only requires importing modin instead of pandas and thats it…no other changes required to your existing code.

One caveat – modin currently uses pandas 0.20.3 (at least it installs pandas 0.20. when modin is installed with pip install modin). If you’re using the latest version of pandas and need functionality that doesn’t exist in previous versions, you might need to wait on checking out modin – or play around with trying to get it to work with the latest version of pandas (I haven’t done that yet).

To install modin:

pip install modin

To use modin:

import modin.pandas as pd

That’s it.  Rather than import pandas as pd you import modin.pandas as pd and you get all the advantages of additional speed.

read_csv_benchmark from Modin
A Read CSV Benchmark provided by Modin

According to the documentation, modin takes advantage of multi-cores on modern machines, which pandas does not do. From their website:

In pandas, you are only able to use one core at a time when you are doing computation of any kind. With Modin, you are able to use all of the CPU cores on your machine. Even in read_csv, we see large gains by efficiently distributing the work across your entire machine.

Let’s give is a shot and see how it works.

For this test, I’m going to try out their read_csv method since its something they highlight. For this test, I have a 105 MB csv file. Lets time both pandas and modin and see how things work.

We’ll start with pandas.

from timeit import default_timer as timer
import pandas as pd

# run 25 iterations of read_csv to get an average
time = []
for i in range (0, 25):
    start = timer()
    df = pd.read_csv('OSMV-20190206.csv')
    end = timer()
    time.append((end - start)) 

# print out the average time taken 
# I *think* I got this little trick from 
# from https://stackoverflow.com/a/9039992/2887031
print reduce(lambda x, y: x + y, time) / len(time)

With pandas, it seems to take – on average – 1.26 seconds to read a 105MB csv file.

Now, lets take a look at modin.

Before continuing, I should share that I had to do a couple extra steps to get modin to work beyond just pip install modin. I had to install typing and dask as well.

pip install "modin[dask]"
pip install typing

Using the exact same code as above (except one minor change to import modin — import modin.pandas as pd.

from timeit import default_timer as timer
import modin.pandas as pd

# run 25 iterations of read_csv to get an average
time = []
for i in range (0, 25):
    start = timer()
    df = pd.read_csv('OSMV-20190206.csv')
    end = timer()
    time.append((end - start)) 

# print out the average time taken 
# I *think* I got this little trick from 
# from https://stackoverflow.com/a/9039992/2887031
print reduce(lambda x, y: x + y, time) / len(time)

With modin, it seems to take – on average – 0.96 seconds to read a 105MB csv file.

Using modin – in this example – I was able to shave off 0.3 seconds from the average read time for reading in that 105MB csv file. That may not seem like a lot of time, but it is a savings of around 27%. Just imagine if you’ve got 5000 csv files to read in that are of similar size, that’s a savings of 1500 seconds on average…that’s 25 minutes of time saved in just reading files.

Modin uses Ray to speed pandas up, so there could be even more savings if you get in and play around with some of the settings of Ray.

I’ll be looking at modin more in the future to use in some of my projects to help gain some efficiencies.  Take a look at it and let me know what you think.

0 0 vote
Article Rating
14 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Miguel Alberto Plazas Wadynski
Miguel Alberto Plazas Wadynski
1 year ago

Excelent work!

Hossein Amini
Hossein Amini
1 year ago

Thank you for sharing

Hossein Amini
Hossein Amini
1 year ago
Reply to  Hossein Amini

Have u seen this error? Do you know how to fix?
No matching distribution found for ray==0.6.2 (from modin)

Rey
Rey
1 year ago
Reply to  Hossein Amini

I have the same problem…
“Could not find a version that satisfies the requirement ray==0.6.2 (from modin) (from versions: ) No matching distribution found for ray==0.6.2 (from modin)”
I use Python 3.6.8 :: Anaconda, Inc.

Aaron Smyth
Aaron Smyth
1 year ago
Reply to  Hossein Amini

Ray does not run on windows! You would need to install a bridge to Linux from windows and then run your python instance from a bash shell. Out of the question.

Peter Andrews
Peter Andrews
1 year ago

Modin relies on Ray and Ray apparently does not exist for Windows: https://github.com/ray-project/ray/issues/631

Maria Dubyaga
Maria Dubyaga
11 months ago

excellent, I see 15 times speed increase at my data, thank you!