I ran across a neat little library called Modin recently that claims to run pandas faster. The one line sentence that they use to describe the project is:
Speed up your Pandas workflows by changing a single line of code
Interesting…and important if true.
Using modin only requires importing modin instead of pandas and thats it…no other changes required to your existing code.
One caveat – modin currently uses pandas 0.20.3 (at least it installs pandas 0.20. when modin is installed with
pip install modin). If you’re using the latest version of pandas and need functionality that doesn’t exist in previous versions, you might need to wait on checking out modin – or play around with trying to get it to work with the latest version of pandas (I haven’t done that yet).
To install modin:
pip install modin
To use modin:
import modin.pandas as pd
That’s it. Rather than
import pandas as pd you
import modin.pandas as pd and you get all the advantages of additional speed.
According to the documentation, modin takes advantage of multi-cores on modern machines, which pandas does not do. From their website:
In pandas, you are only able to use one core at a time when you are doing computation of any kind. With Modin, you are able to use all of the CPU cores on your machine. Even in
read_csv, we see large gains by efficiently distributing the work across your entire machine.
Let’s give is a shot and see how it works.
For this test, I’m going to try out their
read_csv method since its something they highlight. For this test, I have a 105 MB csv file. Lets time both pandas and modin and see how things work.
We’ll start with pandas.
from timeit import default_timer as timer import pandas as pd # run 25 iterations of read_csv to get an average time =  for i in range (0, 25): start = timer() df = pd.read_csv('OSMV-20190206.csv') end = timer() time.append((end - start)) # print out the average time taken # I *think* I got this little trick from # from https://stackoverflow.com/a/9039992/2887031 print reduce(lambda x, y: x + y, time) / len(time)
With pandas, it seems to take – on average – 1.26 seconds to read a 105MB csv file.
Now, lets take a look at modin.
Before continuing, I should share that I had to do a couple extra steps to get modin to work beyond just
pip install modin. I had to install typing and dask as well.
pip install "modin[dask]" pip install typing
Using the exact same code as above (except one minor change to import modin —
import modin.pandas as pd.
from timeit import default_timer as timer import modin.pandas as pd # run 25 iterations of read_csv to get an average time =  for i in range (0, 25): start = timer() df = pd.read_csv('OSMV-20190206.csv') end = timer() time.append((end - start)) # print out the average time taken # I *think* I got this little trick from # from https://stackoverflow.com/a/9039992/2887031 print reduce(lambda x, y: x + y, time) / len(time)
With modin, it seems to take – on average – 0.96 seconds to read a 105MB csv file.
Using modin – in this example – I was able to shave off 0.3 seconds from the average read time for reading in that 105MB csv file. That may not seem like a lot of time, but it is a savings of around 27%. Just imagine if you’ve got 5000 csv files to read in that are of similar size, that’s a savings of 1500 seconds on average…that’s 25 minutes of time saved in just reading files.
Modin uses Ray to speed pandas up, so there could be even more savings if you get in and play around with some of the settings of Ray.
I’ll be looking at modin more in the future to use in some of my projects to help gain some efficiencies. Take a look at it and let me know what you think.
Thank you for sharing
Have u seen this error? Do you know how to fix?
No matching distribution found for ray==0.6.2 (from modin)
Are you using Python 3.7? I’m not sure if they have a wheel for 3.7 yet.
I have the same problem…
“Could not find a version that satisfies the requirement ray==0.6.2 (from modin) (from versions: ) No matching distribution found for ray==0.6.2 (from modin)”
I use Python 3.6.8 :: Anaconda, Inc.
I would suggest you stop by the modin and or Ray github and raise an issue and/or ask over there.
Ray does not run on windows! You would need to install a bridge to Linux from windows and then run your python instance from a bash shell. Out of the question.
Not “out of the question”. There are plenty of ways to get Ray to run when using a Windows machine.
For one, Install Vagrant. Then, run vagrant using Ubuntu. Then, install Ray. .I run it like this often and it runs just fine. Here’s how to install Vagrant on Windows https://pythondata.com/vagrant-on-windows/
Another is using a virtual machine system like VMWare, etc. If these steps are “out of the question” for you, then Ray won’t work for you.
[…] pythondata.com Ref. https://github.com/modin-project/modin, https://github.com/ray-project/ray/, […]
Modin relies on Ray and Ray apparently does not exist for Windows: https://github.com/ray-project/ray/issues/631
Thanks Peter. Good to know.
excellent, I see 15 times speed increase at my data, thank you!