Web scraping image

Quick Tip: Consuming Google Search results to use for web scraping

While working on a project recently, I needed to grab some google search results for specific search phrases and then scrape the content from the page results.

For example, when searching for a Sony 16-35mm f2.8 GM lens on google, I wanted to grab some content (reviews, text, etc) from the results.  While this isn’t hard to build from scratch, I ran across a couple of libraries that are easy to use and make things so much easier.

The first is ‘Google Search‘ (install via pip install google). This library lets you consume google search results with just one line of code. An example is below (this will import google search and run a search for Sony 16-35mm f2.8 GM lens and print out the urls for the search.

For the above, I’m using google.com for the search and have told it to stop after the first set of results.

The output:

That’s pretty easy.

Now, we can use those url’s to scrape the websites that are returned.

To scrape these sites, you could run some fairly complex scraping systems, build your own fairly complex systems…or…if you just need some basic content and aren’t going to be doing a LOT of scraping, you could use the ‘Newspaper‘ library. Of course, there are plenty of other libraries but the newspaper library really simplifies things for those ‘quick and dirty’ projects.  Note: This is best used in python3.

To get started, install newspaper with pip3 install newspaper3k (for python3).

Now, to scrape the urls returned from the google search, you can simply do the following:

This will grab the url, download it and parse it so you can access the content.  Here’s an example of grabbing the url https://www.the-digital-picture.com/Reviews/Sony-FE-16-35mm-f-2.8-GM-Lens.aspx.

The output of the print(article.text is below (I’ve only included an excerpt for this example but this will grab the entire text):

‘Those putting together the ultimate Sony E-mount lens kit are going to want this lens included. The Sony FE 16-35mm f/2.8 GM Lens covers a key focal length range in wide aperture with high quality. In this case, the term high quality applies both to the lens\’ physical attributes and to the image quality delivered by it.\n\nMany are first-attracted to the Alpha MILC (Mirrorless Interchangeable Lens Camera) system for Sony\’s high-performing full frame imaging sensors, but lenses are as important as cameras and Sony\’s lens lineup was initially viewed by many as deficient. Adapting Canon brand lenses for use on Sony cameras was prevalent. The introduction of Sony\’s flagship Grand Master line (the “GM” in the name) was very welcomed by Sony owners and this line is proving attractive to those considering a switch to the Sony camp. The 16-35mm f/2.8 GM is one more reason to stay entirely within the Sony brand.\n\nFocal Length Range\n\nWhen starting a kit, most will first select a general purpose lens (Sony system owners should seriously consider the Sony FE 24-70mm f/2.8 GM Lens) and one of the next-most-needed lenses is typically a wide-angle zoom. This 16-35mm range ideally covers that need.\n\nThe 107° angle of view provided by a 16mm focal length is ultra-wide and all of the narrower angles of view down to 63°, just modestly-wide, are included. To explore what this focal length range looks like, we head to RB Rickett\’s falls in Ricketts Glen State Park.\n\nOne of the most popular uses for this range is, as illustrated above, landscape photography.

Now, one of the really cool features of the newspaper library is that it has built-in natural language processing capabilities and can return keywords, summaries and other interesting tidbits. To get this to work, you must have the Natural Language Toolkit (NLTK) installed (install with pip install nltk) and have the punkt package installed from nltk. Here’s an example using the previous url (and assuming you’ve already done the above steps).

The result:

That’s quite nice (and easy!).  Of course, If I were doing this as a serious NLP Project, i’d write my own NLP functions but for a quick look at keywords of an article, this is a fast way to do it.

If you want to learn more about Natural Language Processing using NLTK, the definitive book is Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit.

Photo by Émile Perron on Unsplash

Quick Tip: Comparing two pandas dataframes and getting the differences

There are times when working with different pandas dataframes that you might need to get the data that is ‘different’ between the two dataframes (i.e.,g Comparing two pandas dataframes and getting the differences). This seems like a straightforward issue, but apparently its still a popular ‘question’ for many people and is my most popular question on stackoverflow.

As an example, let’s look at two pandas dataframes. Both have date indexes and the same structure. How can we compare these two dataframes and find which rows are in dataframe 2 that aren’t in dataframe 1?

dataframe 1 (named df1):

dataframe 2 (named df2):

The answer, it seems, is quite simple – but I couldn’t figure it out at the time.  Thanks to the generosity of stackoverflow users, the answer (or at least an answer that works) is simply to concat the dataframes then perform a group-by via columns and finally re-index to get the unique records based on the index.

Here’s the code (as provided by user alko on stackoverlow):

This simple approach leads to the correct answer:

There are most likely more ‘pythonic’ answers (one suggestion is here) and I’d recommend you dig into those other approaches, but the above works, is easy to read and is  fast enough for my needs.

Want more information about pandas for data analysis? Check out the book Python for Data Analysis by the creator of pandas, Wes McKinney.

Forecasting with Random Forests

When it comes to forecasting data (time series or other types of series), people look to things like basic regression, ARIMA, ARMA, GARCH, or even Prophet but don’t discount the use of Random Forests for forecasting data.

Random Forests are generally considered a classification technique but regression is definitely something that Random Forests can handle.

For this post, I am going to use a dataset found here called Sales Prices of Houses in the City of Windsor (CSV here, description here).  For the purposes of this post, I’ll only use the price and lotsize columns. Note: In a future post, I’m planning to resist this data and perform multivariate regression with Random Forests.

To get started, let’s import all the necessary libraries to get started. As always, you can grab a jupyter notebook to run through this analysis yourself here.

Now, lets load the data:

Again, we are only using two columns from the data set – price and lotsize. Let’s plot this data to take a look at it visually to see if it makes sense to use lotsize as a predictor of price.

Housing Data Visualization

Looking at the data, it looks like a decent guess to think lotsize might forecast price.

Now, lets set up our dataset to get our training and testing data ready.

In the above, we set X and y for the random forest regressor and then set our training and test data. For training data, we are going to take the first 400 data points to train the random forest and then test it on the last 146 data points.

Now, let’s run our random forest regression model.  First, we need to import the Random Forest Regressor from sklearn:

And now….let’s run our Random Forest Regression and see what we get.

Let’s visualize the price and the predicted_price.

price vs predicted price

That’s really not a bad outcome for a wild guess that lotsize predicts price. Visually, it looks pretty good (although there are definitely errors).

Let’s look at the base level error. First, a quick plot of the ‘difference’ between the two.

Price vs Predicted Difference

There are some very large errors in there.  Let’s look at some values like R-Squared and Mean Squared Error. First, lets import the appropriate functions from sklearn.

Now, lets look at R-Squared:

R-Squared is 0.6976…or basically 0.7.  That’s not great but not terribly bad either for a random guess. A value of 0.7 (or 70%) tells you that roughly 70% of the variation of the ‘signal’ is explained by the variable used as a predictor.  That’s really not bad in the grand scheme of things.

I could go on with other calculations for error but the point of this post isn’t to show ‘accuracy’ but to show ‘process’ on how how to use Random Forest for forecasting.

Looks for more posts on using random forests for forecasting.

If you want a very good deep-dive into using Random Forest and other statistical methods for prediction, take a look at The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Amazon Affiliate link)