I’m currently working on a project that has multiple very large CSV files (6 gigabytes+). Normally when working with CSV data, I read the data in using pandas and then start munging and analyzing the data. With files this large, reading the data into pandas directly can be difficult (or impossible) due to memory constrictions, especially if you’re working on a prosumer computer. In this post, I describe a method that will help you when working with large CSV files in python.
While it would be pretty straightforward to load the data from these CSV files into a database, there might be times when you don’t have access to a database server and/or you don’t want to go through the hassle of setting up a server. If you are going to be working on a data set long-term, you absolutely should load that data into a database of some type (mySQL, postgreSQL, etc) but if you just need to do some quick checks / tests / analysis of the data, below is one way to get a look at the data in these large files with python, pandas and sqllite.
To get started, you’ll need to import pandas and sqlalchemy. The commands below will do that.
import pandas as pd from sqlalchemy import create_engine
Next, set up a variable that points to your csv file. This isn’t necessary but it does help in re-usability.
file = '/path/to/csv/file'
With these three lines of code, we are ready to start analyzing our data. Let’s take a look at the ‘head’ of the csv file to see what the contents might look like.
print pd.read_csv(file, nrows=5)
This command uses pandas’ “read_csv” command to read in only 5 rows (nrows=5) and then print those rows to the screen. This lets you understand the structure of the csv file and make sure the data is formatted in a way that makes sense for your work.
Before we can actually work with the data, we need to do something with it so we can begin to filter it to work with subsets of the data. This is usually what I would use pandas’ dataframe for but with large data files, we need to store the data somewhere else. In this case, we’ll set up a local sqllite database, read the csv file in chunks and then write those chunks to sqllite.
To do this, we’ll first need to create the sqllite database using the following command.
csv_database = create_engine('sqlite:///csv_database.db')
Next, we need to iterate through the CSV file in chunks and store the data into sqllite.
chunksize = 100000 i = 0 j = 1 for df in pd.read_csv(file, chunksize=chunksize, iterator=True): df = df.rename(columns={c: c.replace(' ', '') for c in df.columns}) df.index += j i+=1 df.to_sql('table', csv_database, if_exists='append') j = df.index[-1] + 1
With this code, we are setting the chunksize at 100,000 to keep the size of the chunks managable, initializing a couple of iterators (i=0, j=0) and then running through a for loop. The for loop reads a chunk of data from the CSV file, removes spaces from any of column names, then stores the chunk into the sqllite database (df.to_sql(…)).
This might take a while if your CSV file is sufficiently large, but the time spent waiting is worth it because you can now use pandas ‘sql’ tools to pull data from the database without worrying about memory constraints.
To access the data now, you can run commands like the following:
df = pd.read_sql_query('SELECT * FROM table', csv_database)
Of course, using ‘select *…’ will load all data into memory, which is the problem we are trying to get away from so you should throw from filters into your select statements to filter the data. For example:
df = pd.read_sql_query('SELECT COl1, COL2 FROM table where COL1 = SOMEVALUE', csv_database)
Hi recently i”v been trying to use some classification function over a large csv file (consisting of 58000 instances (rows) & 54 columns ) for this approach i need to mage a matrix out of the first 54 columns and all the instances which gives me an array . but the problem is memory can not handle this large array so i searched and found your website now , at the end you used df = pd.read_sql_query(‘SELECT * FROM table’, csv_database) first : the the shows syntax error second : i need to have all the columns from 1 to… Read more »
The code
df = pd.read_sql_query(‘SELECT * FROM table’, csv_database)
is just a sample of what you could do with the data once it is loaded into the database. For what you are trying to do, you’ll need to customize that code for your database, table and data.thanks for your reply
but would you let me know how to do this ?
how to customize that code for my data base ?
Unfortunately, there’s too many unknowns for me to help. I have no idea what your database looks like, what it is called or how you have it set up.
Good luck.
thanks its very helpful…..
Alternatively, you can do the filtering natively in Pandas:
# columns we wish to keep/filter on
cols_to_keep = [‘id’, ‘member_id’, ‘loan_amnt’, ‘funded_amnt’]
# setup dataframe iterator, the ‘usecols’ parameter filters the columns in the csv
df_iter = pd.read_csv(r’/tmp/z_data/LoanStats_2016Q1.csv.zip’, skiprows=1, compression=’zip’,
chunksize=20000, usecols=cols_to_keep)
dfs = [] # this list will store the filtered dataframes for concatenation
for df in df_iter:
temp_df = (df.rename(columns={col: col.lower()
for col in df.columns})
# filter
.pipe(lambda x: x[x.funded_amnt > 10000]))
dfs += [temp_df.copy()]
# combine filtered dfs into large output df
concat_df = pd.concat(dfs)
Good stuff David. Thanks for sharing!
Thanks Eric, very helpful. How do you clear your cache after the SELECT operation?
That’s a really good question Robert. I don’t know off the top of my head but will try to take a look at it soon.
I did everything the way you said, but i can’t query the database. Can you help me how to do it? I know i have some missing knowledge. I get the error: OperationalError: (sqlite3.OperationalError) near “table”: syntax error [SQL: ‘SELECT * FROM table’]
I think i didn’t created a table inside the database or i dont know, but i imported from my csv file the way you did.
Right. You need to have a table in the database. Then…you need to select from that table and use the correct name of the table. SO your sql statement would be ‘select & from TABLENAME’ where TABLENAME would be your actual table name
I copied this example exactly and had the same error. Put table in double quotes and it worked.
df = pd.read_sql_query(‘SELECT * FROM “table”‘, csv_database)
THANK YOU! I was stuck on this as well, adding double quotes around “table” solved for me too.
Thank you.
I had the same error and your solution worked.
Thanks.
This provided to be very useful.
I get an error on this line: df.index += j
TypeError: can only concatenate tuple (not “int”) to tuple
Thanks a lot! Really helpful.
Hi ,
I am having a csv with nearly 100 rows and 100k columns. I would like to know how to chunk the dataset. In most case, i noticed only data chunks by rows but how do we handle for columns
If you have 100 rows and 100K columns, I’d transpose it and work by row instead of by column
Hi i have CSV Dataset which have 311030 rows and 42 columns and want to upload into table widget in pyqt4 .When i upload this dataset into the table widget by CSV.reader() the application stop working and a pop window appear which shown this words”Python stop working” so Kindly Guide me How to solve this problem.Thanks
You are most likely running out of memory when loading the CSV file. You’ll need to load the csv data in chunks (and use paging on the table) most likely.
Excuse me sir Can you Guide me more because i have no such experience which you wrote in above comment regarding chunks and paging on the table.so following is my code can you please edit him according to your own views .thanks
path = QFileDialog.getOpenFileName(self, “Open File”, os.getenv(‘Home’),’*.csv’)
list_name = []
with open(path) as csvfile:
reader = csv.reader(csvfile)
for line in reader:
list_name.append(line)
self.tableWidget.setRowCount(len(list_name))
self.tableWidget.setColumnCount(len(list_name[0]))
for i, row in enumerate(list_name):
for j, col in enumerate(row):
item = QTableWidgetItem(col)
self.tableWidget.setItem(i, j, item)
Sorry, but I’m not able to assist with this. You should try StackOverflow.com for help.
OK sir thank you so much
Hi , This is great article and very well explained!! When I am trying to load 13 gb pipe delimited csv file with 500+ column and 5 million rows where I am getting out of memory error though i set the ‘Pragma cache_size = 10000′. Please suggest if there is an option to try for me. Traceback (most recent call last): File “C:/Users/krishnd/PycharmProjects/DataMasking/maskAccountMasterSqlite_Tune.py”, line 232, in main() File “C:/Users/krishnd/PycharmProjects/DataMasking/maskAccountMasterSqlite_Tune.py”, line 205, in main uploadtodb(conn) File “C:/Users/krishnd/PycharmProjects/DataMasking/maskAccountMasterSqlite_Tune.py”, line 31, in uploadtodb for df in pd.read_csv(file, sep=’|’, chunksize=chunksize, iterator=True, low_memory=False): File “C:\Users\krishnd\PycharmProjects\DataMasking\venv\lib\site-packages\pandas\io\parsers.py”, line 1115, in __next__ return self.get_chunk() File “C:\Users\krishnd\PycharmProjects\DataMasking\venv\lib\site-packages\pandas\io\parsers.py”, line 1173,… Read more »
Hi Dinesh – thanks for the comment and for stopping by. I’m not sure what’s going on here, other than you could be running out of physical memory / hard drive space / etc. Even with Dask, you can still hit limits like this. I would stop by the Dask Github and ask over there (https://github.com/dask/dask)