There are times when working with different pandas dataframes that you might need to get the data that is ‘different’ between the two dataframes (i.e.,g Comparing two pandas dataframes and getting the differences). This seems like a straightforward issue, but apparently its still a popular ‘question’ for many people and is my most popular question on stackoverflow.
As an example, let’s look at two pandas dataframes. Both have date indexes and the same structure. How can we compare these two dataframes and find which rows are in dataframe 2 that aren’t in dataframe 1?
dataframe 1 (named df1):
Date Fruit Num Color
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green
dataframe 2 (named df2):
Date Fruit Num Color
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green
2013-11-25 Apple 22.1 Red
2013-11-25 Orange 8.6 Orange
The answer, it seems, is quite simple – but I couldn’t figure it out at the time. Thanks to the generosity of stackoverflow users, the answer (or at least an answer that works) is simply to concat the dataframes then perform a group-by via columns and finally re-index to get the unique records based on the index.
df = pd.concat([df1, df2]) # concat dataframes
df = df.reset_index(drop=True) # reset the index
df_gpby = df.groupby(list(df.columns)) #group by
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] #reindex
This simple approach leads to the correct answer:
Date Fruit Num Color
9 2013-11-25 Orange 8.6 Orange
8 2013-11-25 Apple 22.1 Red
There are most likely more ‘pythonic’ answers (one suggestion is here) and I’d recommend you dig into those other approaches, but the above works, is easy to read and is fast enough for my needs.
Want more information about pandas for data analysis? Check out the book Python for Data Analysis by the creator of pandas, Wes McKinney.
When it comes to forecasting data (time series or other types of series), people look to things like basic regression, ARIMA, ARMA, GARCH, or even Prophet but don’t discount the use of Random Forests for forecasting data.
Random Forests are generally considered a classification technique but regression is definitely something that Random Forests can handle.
For this post, I am going to use a dataset found here called Sales Prices of Houses in the City of Windsor (CSV here, description here). For the purposes of this post, I’ll only use the price and lotsize columns. Note: In a future post, I’m planning to resist this data and perform multivariate regression with Random Forests.
To get started, let’s import all the necessary libraries to get started. As always, you can grab a jupyter notebook to run through this analysis yourself here.
import pandas as pd
import matplotlib.pyplot as plt
# lets set the figure size and color scheme for plots
# personal preference and not needed).
plt.rcParams['figure.figsize']=(20,10)
plt.style.use('ggplot')
Again, we are only using two columns from the data set – price and lotsize. Let’s plot this data to take a look at it visually to see if it makes sense to use lotsize as a predictor of price.
df.plot(subplots=True)
Looking at the data, it looks like a decent guess to think lotsize might forecast price.
Now, lets set up our dataset to get our training and testing data ready.
In the above, we set X and y for the random forest regressor and then set our training and test data. For training data, we are going to take the first 400 data points to train the random forest and then test it on the last 146 data points.
Now, let’s run our random forest regression model. First, we need to import the Random Forest Regressor from sklearn:
from sklearn.ensemble.forest import RandomForestRegressor
And now….let’s run our Random Forest Regression and see what we get.
# build our RF model
RF_Model = RandomForestRegressor(n_estimators=100,
max_features=1, oob_score=True)
# let's get the labels and features in order to run our
# model fitting
labels = y_train#[:, None]
features = X_train[:, None]
# Fit the RF model with features and labels.
rgr=RF_Model.fit(features, labels)
# Now that we've run our models and fit it, let's create
# dataframes to look at the results
X_test_predict=pd.DataFrame(
rgr.predict(X_test[:, None])).rename(
columns={0:'predicted_price'}).set_index('predicted_price')
X_train_predict=pd.DataFrame(
rgr.predict(X_train[:, None])).rename(
columns={0:'predicted_price'}).set_index('predicted_price')
# combine the training and testing dataframes to visualize
# and compare.
RF_predict = X_train_predict.append(X_test_predict)
Let’s visualize the price and the predicted_price.
df[['price', 'predicted_price']].plot()
That’s really not a bad outcome for a wild guess that lotsize predicts price. Visually, it looks pretty good (although there are definitely errors).
Let’s look at the base level error. First, a quick plot of the ‘difference’ between the two.
There are some very large errors in there. Let’s look at some values like R-Squared and Mean Squared Error. First, lets import the appropriate functions from sklearn.
R-Squared is 0.6976…or basically 0.7. That’s not great but not terribly bad either for a random guess. A value of 0.7 (or 70%) tells you that roughly 70% of the variation of the ‘signal’ is explained by the variable used as a predictor. That’s really not bad in the grand scheme of things.
I could go on with other calculations for error but the point of this post isn’t to show ‘accuracy’ but to show ‘process’ on how how to use Random Forest for forecasting.
Looks for more posts on using random forests for forecasting.
For years I’ve used the mysql-python library for connecting to mysql databases. It’s worked well for me over the years but there are times when you need speed and/or better connection management that what you get with mysql-python. That’s where SQLAlchemy comes in.
Before diving into this, if you are doing things that aren’t dependent on speed (e.g., it doesn’t matter if it takes 1 second to connect to the database and grab your data and close the database) then you can easily ignore this tip. That said, if you have multiple connections, that connect time can add up.
For example, I recently had an issue where it was taking 4.5+ seconds to connect to a database, run analysis and spit out the results. That’s not terrible if its something for you only but if its a production system and speed is a requirement, that might be too long (and it IS too long).
When I did some analysis using python’s timer() I found that more than 50% of that 4.5 seconds time was in establishing database connections so I grabbed my trusty SQLAlchemy toolkit and went to work.
For those of you that don’t know, SQLAlchemy is a ‘python SQL toolkit and Object Relational Mapper’ (ORM) that is supposed to make things easier when working with SQL databases. For me, the ORM aspect tends to make things more difficult so I tend to stick with plain SQL queries but the SQL toolkit aspect of SQLAlchemy makes a lot of sense and add some time savings when connecting to a SQL database.
Before we get into the SQLAlchemy aspects, let’s take a second to look at how to connect to a SQL database with the mysql-python connector (or at least take a look at how I do it).
First, let’s setup our import statements. For this, we will import MySQLdb, pandas and pandas.io.sql in order to read SQL data directly into a pandas dataframe.
import pandas as pd
import MySQLdb
import pandas.io.sql as psql
Next, let’s create a database connection, create a query, execute that query and close that database.
# setup the database connection. There's no need to setup cursors with pandas psql.
db=MySQLdb.connect(host=HOST, user=USER, passwd=PW, db=DBNAME)
# create the query
query = "select * from TABLENAME"
# execute the query and assign it to a pandas dataframe
df = psql.read_sql(query, con=db)
# close the database connection
db.close()
This is a fairly standard approach to reading data into a pandas dataframe from mysql using mysql-python. This approach is what I had been using before when I was getting 4.5+ seconds as discussed above. Note – there were multiple database calls and some analysis included in that 4.5+ seconds. A basic database call like the above ran in approximately 0.45 seconds in my code that I was trying to improve performance on and establishing the database connection was the majority of that time.
To improve performance – especially if you will have multiple calls to multiple tables, you can use SQLAlchemy with pandas. You’ll need to pip install sqlalchemy if you don’t have it installed already. Now, let’s setup our imports:
import pandas as pd
import sqlalchemy as sql
Now you can setup your connection string to your database for SQLAlchemy, you’d put everything together like the following:
connect_string = 'mysql://USER:PW@DBHOST/DB'
where USER is your username, PW is your password, DBHOST is the database host and DB is the database you want to connect to.
To setup the persistent connection, you do the following:
sql_engine = sql.create_engine(connect_string)
Now, you have a connection to your database and you’re ready to go. No need to worry about cursors or opening/closing database connections. SQLAlchemy keeps the connection management aspects in for you.
Now all you need to do is focus on your SQL queries and loading the results into a pandas dataframe.
That’s all it takes. AND…it’s faster. In the example above, my database setup / connection / query / closing times dropped from 0.45 seconds to 0.15 seconds. Times will vary based on what data you are querying and where the database is of course but in this case, all things were the same except for mysql-python being replaced with SQLAlchemy and using the new(ish) read_sql_query function in pandas.
Using this approach, the 4.5+ seconds it took to grab data, analyze the data and return the data was reduced to about 1.5 seconds. Impressive gains for just switching out the connection/management method.
If you'd like to receive updates when new posts are published, signup for my mailing list. I won't sell or share your email.