pandas Cheat Sheet (via yhat)

pandas cheat sheetThe folks over at yhat just released a cheat sheet for pandas.  You can download the cheat sheet in PDF for here.

There’s a couple important functions that I use all the time missing from their cheat sheet (actually….there are a lot of things missing, but its a great starter cheat sheet).

A few things that I use all the time with pandas dataframes that are worth collecting in one place are provided below.

Renaming columns in a pandas dataframe:

Iterating over a pandas dataframe:

Splitting pandas dataframe into chunks:

The function plus the function call will split a pandas dataframe (or list for that matter) into NUM_CHUNKS chunks. I use this often when working with the multiprocessing libary.

Accessing the value of a specific cell:

This will give you the value of the last row’s “COLUMN” cell.  This may not be the ‘best’ way to do it, but it gets the value

Getting rows matching a condition:

The below will get all rows in a pandas dataframe that match the criteria.  In addition to finding equality, you can do all the logical operators.

Getting rows matching multiple conditions:

This gets rows that match a criteria in COLUMN1 and those that match another criteria in COLUMN2

Getting the ‘next’ row of data in a pandas dataframe

I’m currently working with stock market trade data that is output from a backtesting engine (I’m working with backtrader currently) in a pandas dataframe.  The format of the ‘transcations’ data that is provided out of the backtesting engine is shown below.

amount price value
2016-01-07 00:00:00+00:00 79.017119 195.33 -15434.413883
2016-09-07 00:00:00+00:00 -79.017119 218.84 17292.106354
2016-09-20 00:00:00+00:00 82.217609 214.41 -17628.277649
2016-11-16 00:00:00+00:00 -82.217609 217.56 17887.263119

The data provided gives four crucial pieces of information:

  • Date – The date of a transaction.
  • Amount – the number of shares purchased (positive number) or sold (negative number)  during the transaction.
  • Price – the price received or paid at the time of the sale.
  • Value – the cash value of the transaction.

Backtrader’s transactions dataframe is comprised of two rows make for one one transaction (the first is the ‘buy’ the second is the ‘sell).  For example, in the data above, the first two rows (Jan 7 2016 and Sept 7th 2016) are the ‘buy’ data and ‘sell’ data for one transaction. What I need to do with this data is transform it (using that term loosely) into one row of data for each transaction to store into  database for use in another analysis.

I could leave it in its current form, but I prefer to store transactions in one row when dealing with market backtests.

There are a few ways to attack this particular problem.  You could iterate over the dataframe and manually pick each row. That would be pretty straightforward, but not necessarily the best way.

While looking around the web for some pointers, I stumbled across this answer that does exactly what I need to do.   I added the following code to my script and — voila — I have my transactions transformed from two rows per transaction to one row.

Note: In the above, I only want to build rows that start with a positive amount in the ‘amount’ column because amount in the first row of a transactions is always positive. I then append each transformed transaction into an array to be used for more analysis down at a later point in the script.

Working with large CSV files in Python

large csv files in pythonI’m currently working on a project that has multiple very large CSV files (6 gigabytes+). Normally when working with CSV data, I read the data in using pandas and then start munging and analyzing the data. With files this large, reading the data into pandas directly can be difficult (or impossible) due to memory constrictions, especially if you’re working on a prosumer computer. In this post, I describe a method that will help you when working with large CSV files in python.

While it would be pretty straightforward to load the data from these CSV files into a database, there might be times when you don’t have access to a database server and/or you don’t want to go through the hassle of setting up a server.  If you are going to be working on a data set long-term, you absolutely should load that data into a database of some type (mySQL, postgreSQL, etc) but if you just need to do some quick checks / tests / analysis of the data, below is one way to get a look at the data in these large files with python, pandas and sqllite.

To get started, you’ll need to import pandas and sqlalchemy. The commands below will do that.

Next, set up a variable that points to your csv file.  This isn’t necessary but it does help in re-usability.

With these three lines of code, we are ready to start analyzing our data. Let’s take a look at the ‘head’ of the csv file to see what the contents might look like.

This command uses pandas’ “read_csv” command to read in only 5 rows (nrows=5) and then print those rows to the screen. This lets you understand the structure of the csv file and make sure the data is formatted in a way that makes sense for your work.

Before we can actually work with the data, we need to do something with it so we can begin to filter it to work with subsets of the data. This is usually what I would use pandas’ dataframe for but with large data files, we need to store the data somewhere else. In this case, we’ll set up a local sqllite database, read the csv file in chunks and then write those chunks to sqllite.

To do this, we’ll first need to create the sqllite database using the following command.

Next, we need to iterate through the CSV file in chunks and store the data into sqllite.

With this code, we are setting the chunksize at 100,000 to keep the size of the chunks managable, initializing a couple of iterators (i=0, j=0) and then running through a for loop.  The for loop reads a chunk of data from the CSV file, removes spaces from any of column names, then stores the chunk into the sqllite database (df.to_sql(…)).

This might take a while if your CSV file is sufficiently large, but the time spent waiting is worth it because you can now use pandas ‘sql’ tools to pull data from the database without worrying about memory constraints.

To access the data now, you can run commands like the following:

Of course, using ‘select *…’ will load all data into memory, which is the problem we are trying to get away from so you should throw from filters into your select statements to filter the data. For example: