Month: November 2016

Working with large CSV files in Python

large csv files in pythonI’m currently working on a project that has multiple very large CSV files (6 gigabytes+). Normally when working with CSV data, I read the data in using pandas and then start munging and analyzing the data. With files this large, reading the data into pandas directly can be difficult (or impossible) due to memory constrictions, especially if you’re working on a prosumer computer. In this post, I describe a method that will help you when working with large CSV files in python.

While it would be pretty straightforward to load the data from these CSV files into a database, there might be times when you don’t have access to a database server and/or you don’t want to go through the hassle of setting up a server.  If you are going to be working on a data set long-term, you absolutely should load that data into a database of some type (mySQL, postgreSQL, etc) but if you just need to do some quick checks / tests / analysis of the data, below is one way to get a look at the data in these large files with python, pandas and sqllite.

To get started, you’ll need to import pandas and sqlalchemy. The commands below will do that.

Next, set up a variable that points to your csv file.  This isn’t necessary but it does help in re-usability.

With these three lines of code, we are ready to start analyzing our data. Let’s take a look at the ‘head’ of the csv file to see what the contents might look like.

This command uses pandas’ “read_csv” command to read in only 5 rows (nrows=5) and then print those rows to the screen. This lets you understand the structure of the csv file and make sure the data is formatted in a way that makes sense for your work.

Before we can actually work with the data, we need to do something with it so we can begin to filter it to work with subsets of the data. This is usually what I would use pandas’ dataframe for but with large data files, we need to store the data somewhere else. In this case, we’ll set up a local sqllite database, read the csv file in chunks and then write those chunks to sqllite.

To do this, we’ll first need to create the sqllite database using the following command.

Next, we need to iterate through the CSV file in chunks and store the data into sqllite.

With this code, we are setting the chunksize at 100,000 to keep the size of the chunks managable, initializing a couple of iterators (i=0, j=0) and then running through a for loop.  The for loop reads a chunk of data from the CSV file, removes spaces from any of column names, then stores the chunk into the sqllite database (df.to_sql(…)).

This might take a while if your CSV file is sufficiently large, but the time spent waiting is worth it because you can now use pandas ‘sql’ tools to pull data from the database without worrying about memory constraints.

To access the data now, you can run commands like the following:

Of course, using ‘select *…’ will load all data into memory, which is the problem we are trying to get away from so you should throw from filters into your select statements to filter the data. For example:

 

Eric D. Brown , D.Sc. has a doctorate in Information Systems with a specialization in Data Sciences, Decision Support and Knowledge Management. He writes about utilizing python for data analytics at pythondata.com and the crossroads of technology and strategy at ericbrown.com

Installing Python on OSX (and the necessary modules)

If you need help installing python on OSX, read on.

For the last three years, I’ve used a mac for all my development. I love the fact that everything ‘just works’ on the platform. That said, when you get into scientific computing and data analytics, especially with python, you can  run into some issues.

Just like linux, python is included with the operating system. Unlike linux, this can cause problems long-term for you due to upgrades and changes that Apple may make to the python ecosystem.

On OS X, I recommend those of you starting out to go with Anaconda or Enthought Canopy.  As I said in “Installing python on Windows“, I prefer Canopy over Anaconda for scientific computing / data analytics but either will work for you.  Installing Canopy on the mac is very similar to installing it on Windows…so I’ll let this post be your guide for installing Canopy.

If you want to get into the nitty-gritty and install and configure python and the modules yourself, you can easily do so, but be prepared to spend some time on the command line.

Before we get started installing python on your Mac, we need to install homebrew, which is a package manager for OS X (it acts similar to the ‘apt’ package manager on ubuntu / debian).

To install homebrew, open a terminal and paste the following:

This command installs the homebrew ecosystem onto your machine and preps your machine to be ready to install various packages, including python.

Installing Python on OSX

Step 1: Let’s get python installed via homebrew.  In your terminal, type:

This will install a version of python onto your machine and set up your environment to use that version. This helps mitigate any issues you might have down the road if / when Apple makes changes to the system provided python.   Additionally, brew installs pip into the system to make it easy to get the necessary modules onto your machine.

From this point on, we are generally going to follow exactly the same steps that I outline in Installing Python on Linux except we don’t need to install any additional tools.

Step 2: Not required, but highly recommended – install a virtual environment.  I recommend virtualenv. Install it with this command:

When you are ready to get started on a new project, type the below command to install python into a new virtual environment (the ‘env’ is the name of the environment). You only have to do this once per project. Note: You should use a folder per project to keep your virtual environments separated.

Whenever you want to work on a specific project, change into that folder and type the following. This will set up your environment with all of your installed python modules:

For the purpose of this walk-through let’s create a new directory, set up a new virtual environment and then install the necessary modules.

  • Create a folder in your home directory called ‘projects’.
  • Type “mkdir projects” to do this from the command line.
  • Change into that folder and then type “mkdir install_example” to create another folder inside the projects folder.
  • Type “virtualenv env” to create your virtual environment.
  • Type “source env/bin/activate” to begin using this environment

Now that we have our environment ready to go, we need to install some of the modules that are most often used when doing data work inside python. These modules are:

The above modules can be installed with one pip command.

You’re ready to start working with python for data analysis on your mac. Just remember, for each virtualenv you create, you’ll need to reinstall these modules if you wish to use them.

Check back here often for more information on using the above modules to actually DO something.

Eric D. Brown , D.Sc. has a doctorate in Information Systems with a specialization in Data Sciences, Decision Support and Knowledge Management. He writes about utilizing python for data analytics at pythondata.com and the crossroads of technology and strategy at ericbrown.com

Installing Python on Linux (and the necessary modules)

Need help installing python on linux?  I can hopefully help.  To get started installing python on linux, there are a couple of options for you. The first option – which is most likely the easiest with the least headaches – is to go download Anaconda or Enthought Canopy.  Either of these routes will get python installed and configured in such a way that will allow you to step right in and just use it.

As I said in “Installing python on Windows“, I prefer Canopy over Anaconda for scientific computing / data analytics but either will work for you.  Installing Canopy on linux is very similar to installing it on Windows…so I’ll let this post be your guide for installing Canopy.

The second option for python is to  install all the pieces yourself using the command line and/or the package manager provided by your linux distribution.  I’m a fan of the command line and will provide that overview here.

I’m going to assume that you are on a recent flavor of ubuntu for this (I’m using 16.04.1). If you are on another distribution, contact me and I can give you the instructions for those distros.

Before we get started, you should know that every linux distribution that I know of has python already installed, and most have python 2.7 installed.  I prefer 2.7 for data analytics so we’ll stick with that during this installation process.

Installing python on Linux (ubuntu) from the command line

Step 1 – open a terminal window and type “python”.  Tada…you’re done! (not really). As I said above, python 2.7 is installed on most (all?) linux distributions. There’s more to getting ready to use python for scientific applications / data science than just having python though.

Installing Python on linux - checkType “exit()” into the python interpreter if you haven’t already closed it.

Installing Python on linux - exitStep 2 – install various development tools for python that may not be installed.  These include the ‘build-essential’ tools for linux, python’s ‘pip’ tool (to make python module installations easier) and ‘python-dev’ (needed for python headers, etc). In your terminal, type the following (note the ‘-y’ tells the apt-get command to install the items without asking for confirmation):

Step 3: Install a ‘virtual environment’.  This isn’t a requirement, but I strongly recommend it as it allows you to segregate the various types of installations and versions of your python modules. For example, lets say you do some development on python using pandas version 0.19 on all your projects.   In six months, pandas upgrades and deprecates something that causes your code to break. You downgrade to pandas 0.19 to keep your code working but then see that pandas 0.21 contains an absolute ‘must have’ for a new project. What do you do? Re-write all of your code to use 0.21 or stay with 0.19? With a virtual environment, you can do both.

I use and recommend ‘virtualenv’. There are other options out there (using docker, individual virtual machines, etc) but virtualenv is the simplest / quickest way to get things done.  With virtualenv installed, you can install specific versions of python modules for a project while using other versions of modules for other projects.

To install virtualenv type the following (note that we are using ‘pip’ now rather than apt-get):

Now, whenever you start a new project, type the following to install python into a new virtual environment (the ‘env’ is the name of the environment). You only have to do this once per project. Note: You should use a folder per project to keep your virtual environments separated.

Whenever you want to work on a specific project, change into that folder and type the following. This will set up your environment with all of your installed python modules:

For the purpose of this Installing python on linux walk-through perform the following commands:

  • Create a folder in your home directory called ‘projects’.
  • Type “mkdir projects” to do this from the command line.
  • Change into that folder and then type “mkdir install_example” to create another folder inside the projects folder.
  • Type “virtualenv env” to create your virtual environment.
  • Type “source env/bin/activate” to begin using this environment
  • You should see something similar to the below.

Installing Python on linux - virtualenv

Now that we have our environment ready to go, we need to install some of the modules that are most often used when doing data work inside python. These modules are:

The above modules can be installed with one pip command.

You’re ready to start working with python for data analysis. Just remember, for each virtualenv you create, you’ll need to reinstall these modules if you wish to use them.

Check back here often for more information on using the above modules to actually DO something.

Eric D. Brown , D.Sc. has a doctorate in Information Systems with a specialization in Data Sciences, Decision Support and Knowledge Management. He writes about utilizing python for data analytics at pythondata.com and the crossroads of technology and strategy at ericbrown.com