Working with large CSV files in Python

large csv files in pythonI’m currently working on a project that has multiple very large CSV files (6 gigabytes+). Normally when working with CSV data, I read the data in using pandas and then start munging and analyzing the data. With files this large, reading the data into pandas directly can be difficult (or impossible) due to memory constrictions, especially if you’re working on a prosumer computer. In this post, I describe a method that will help you when working with large CSV files in python.

While it would be pretty straightforward to load the data from these CSV files into a database, there might be times when you don’t have access to a database server and/or you don’t want to go through the hassle of setting up a server.  If you are going to be working on a data set long-term, you absolutely should load that data into a database of some type (mySQL, postgreSQL, etc) but if you just need to do some quick checks / tests / analysis of the data, below is one way to get a look at the data in these large files with python, pandas and sqllite.

To get started, you’ll need to import pandas and sqlalchemy. The commands below will do that.

Next, set up a variable that points to your csv file.  This isn’t necessary but it does help in re-usability.

With these three lines of code, we are ready to start analyzing our data. Let’s take a look at the ‘head’ of the csv file to see what the contents might look like.

This command uses pandas’ “read_csv” command to read in only 5 rows (nrows=5) and then print those rows to the screen. This lets you understand the structure of the csv file and make sure the data is formatted in a way that makes sense for your work.

Before we can actually work with the data, we need to do something with it so we can begin to filter it to work with subsets of the data. This is usually what I would use pandas’ dataframe for but with large data files, we need to store the data somewhere else. In this case, we’ll set up a local sqllite database, read the csv file in chunks and then write those chunks to sqllite.

To do this, we’ll first need to create the sqllite database using the following command.

Next, we need to iterate through the CSV file in chunks and store the data into sqllite.

With this code, we are setting the chunksize at 100,000 to keep the size of the chunks managable, initializing a couple of iterators (i=0, j=0) and then running through a for loop.  The for loop reads a chunk of data from the CSV file, removes spaces from any of column names, then stores the chunk into the sqllite database (df.to_sql(…)).

This might take a while if your CSV file is sufficiently large, but the time spent waiting is worth it because you can now use pandas ‘sql’ tools to pull data from the database without worrying about memory constraints.

To access the data now, you can run commands like the following:

Of course, using ‘select *…’ will load all data into memory, which is the problem we are trying to get away from so you should throw from filters into your select statements to filter the data. For example:

 

Data Analytics & Python

data analytics & pythonSo you want (or need) to analyze some data. You’ve got some data in an excel spreadsheet or database somewhere and you’ve been asked to take that data and do something useful with it. Maybe its time for data analytics & Python?

Maybe you’ve been asked to build some models for predictive analytics. Maybe you’ve been asked to better understand your customer base based on their previous purchases and activity.  Perhaps you’ve been asked to build a new business model to generate new revenue.

Where do you start?

You could go out and spend a great deal of money on systems to help you in your analytics efforts, or you could start with tools that are available to you already.  You could open up excel, which is very much overlooked by people these days for data analytics. Or…you could install open source tools (for free!) and begin hacking away.

When I was in your shoes in my first days playing around with data, I started with excel. I quickly moved on to other tools because the things I needed to do seemed difficult to accomplish in excel. I then installed R and began to learn ‘real’ data analytics (or so I thought).

I liked (and still do like) R, but it never felt like ‘home’ to me.  After a few months poking around in R, I ran across python and fell in love. Python felt like home to me.

With python, I could quickly cobble together a script to do just about anything I needed to do. In the 5+ years I’ve been working with python now, I’ve not found anything that I cannot do with python and freely available modules.

Need to do some time series analysis and/or forecasting? Python and statsmodels (along with others).

Need to do some natural language processing?  Python and NLTK (along with others).

Need to do some machine learning work? Python and sklearn (along with others).

You don’t HAVE to use python for data analysis. R is perfectly capabale of doing the same things python is – and in some cases, R has more capabilities than python does because its been used an analytics tool for much longer than python has.

That said, I prefer python and use python in everything I do. Data analytics & python go together quite well.