There are times when working with different pandas dataframes that you might need to get the data that is ‘different’ between the two dataframes (i.e.,g Comparing two pandas dataframes and getting the differences). This seems like a straightforward issue, but apparently its still a popular ‘question’ for many people and is my most popular question on stackoverflow.
As an example, let’s look at two pandas dataframes. Both have date indexes and the same structure. How can we compare these two dataframes and find which rows are in dataframe 2 that aren’t in dataframe 1?
dataframe 1 (named df1):
Date Fruit Num Color 2013-11-24 Banana 22.1 Yellow 2013-11-24 Orange 8.6 Orange 2013-11-24 Apple 7.6 Green 2013-11-24 Celery 10.2 Green
dataframe 2 (named df2):
Date Fruit Num Color 2013-11-24 Banana 22.1 Yellow 2013-11-24 Orange 8.6 Orange 2013-11-24 Apple 7.6 Green 2013-11-24 Celery 10.2 Green 2013-11-25 Apple 22.1 Red 2013-11-25 Orange 8.6 Orange
The answer, it seems, is quite simple – but I couldn’t figure it out at the time. Thanks to the generosity of stackoverflow users, the answer (or at least an answer that works) is simply to concat the dataframes then perform a group-by via columns and finally re-index to get the unique records based on the index.
Here’s the code (as provided by user alko on stackoverlow):
df = pd.concat([df1, df2]) # concat dataframes df = df.reset_index(drop=True) # reset the index df_gpby = df.groupby(list(df.columns)) #group by idx = [x for x in df_gpby.groups.values() if len(x) == 1] #reindex
This simple approach leads to the correct answer:
Date Fruit Num Color 9 2013-11-25 Orange 8.6 Orange 8 2013-11-25 Apple 22.1 Red
There are most likely more ‘pythonic’ answers (one suggestion is here) and I’d recommend you dig into those other approaches, but the above works, is easy to read and is fast enough for my needs.
Want more information about pandas for data analysis? Check out the book Python for Data Analysis by the creator of pandas, Wes McKinney.