Quick Tip: Comparing two pandas dataframes and getting the differences

There are times when working with different pandas dataframes that you might need to get the data that is ‘different’ between the two dataframes (i.e.,g Comparing two pandas dataframes and getting the differences). This seems like a straightforward issue, but apparently its still a popular ‘question’ for many people and is my most popular question on stackoverflow.

As an example, let’s look at two pandas dataframes. Both have date indexes and the same structure. How can we compare these two dataframes and find which rows are in dataframe 2 that aren’t in dataframe 1?

dataframe 1 (named df1):

Date       Fruit  Num  Color 
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange  8.6 Orange
2013-11-24 Apple   7.6 Green
2013-11-24 Celery 10.2 Green

dataframe 2 (named df2):

Date       Fruit  Num  Color 
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange  8.6 Orange
2013-11-24 Apple   7.6 Green
2013-11-24 Celery 10.2 Green
2013-11-25 Apple  22.1 Red
2013-11-25 Orange  8.6 Orange

The answer, it seems, is quite simple – but I couldn’t figure it out at the time.  Thanks to the generosity of stackoverflow users, the answer (or at least an answer that works) is simply to concat the dataframes then perform a group-by via columns and finally re-index to get the unique records based on the index.

Here’s the code (as provided by user alko on stackoverlow):

df = pd.concat([df1, df2]) # concat dataframes
df = df.reset_index(drop=True) # reset the index
df_gpby = df.groupby(list(df.columns)) #group by
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] #reindex

This simple approach leads to the correct answer:

        Date   Fruit   Num   Color
9  2013-11-25  Orange   8.6  Orange
8  2013-11-25   Apple  22.1     Red

There are most likely more ‘pythonic’ answers (one suggestion is here) and I’d recommend you dig into those other approaches, but the above works, is easy to read and is  fast enough for my needs.


Want more information about pandas for data analysis? Check out the book Python for Data Analysis by the creator of pandas, Wes McKinney.

0 0 vote
Article Rating
7 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
trackback

[…] post Quick Tip: Comparing two pandas dataframes and getting the differences appeared first on Python […]

Jurryt Pietersma
1 year ago

Consider passing the dataframes to concat in a dictionary, results in a multi-index dataframe from which you can easily delete the duplicates, which results in a multi-index dataframe with the differences between the dataframes:

https://stackoverflow.com/questions/20225110/comparing-two-dataframes-and-getting-the-differences/42652112#42652112

Saz
Saz
1 year ago

Is it possible to find the similarity between the two or multiple dataframes (instead of finding difference), and express the similarity in percentage (%)?

Saz
Saz
1 year ago

Similarity here I mean, compare the dataframes and look at how many similar rows they have