Category: data analytics

LIME explanation of edible vs poisonous

Local Interpretable Model-agnostic Explanations – LIME in Python

When working with classification and/or regression techniques, its always good to have the ability to ‘explain’ what your model is doing. Using Local Interpretable Model-agnostic Explanations (LIME), you now have the ability to quickly provide visual explanations of your model(s).

Its quite easy to throw numbers or content into an algorithm and get a result that looks good. We can test for accuracy and feel confident that the classifier and/or model is ‘good’…but can we describe what the model is actually doing to other users? A good data scientist spends some of their time making sure they have reasonable explanations for what the model is doing and why the results are what they are.

There’s always been a focus on ‘trust’ in any type of modeling methodology but with machine learning and deep learning, many people feel like the black-box approach taken with these methods isn’t as trustworthy as other methods.  This topic was addressed in a paper titled Why Should I Trust You?”: Explaining the Predictions of Any Classifier, which proposes the concept of Local Interpretable Model-agnostic Explanations (LIME). According to the paper, LIME is ‘an algorithm that can explain the predictions of any classifier or regressor in a faithful way, by approximating it locally with an interpretable model.’

I’ve used the LIME approach a few times in recent projects and really like the idea. It breaks down the modeling / classification techniques and output into a form that can be easily described to non-technical people.  That said, LIME isn’t a replacement for doing your job as a data scientist, but it is another tool to add to your toolbox.

To implement LIME in python, I use this LIME library written / released by one of the authors the above paper.

I thought it might be good to provide a quick run-through of how to use this library. For this post, I’m going to mimic “Using lime for regression” notebook the authors provide, but I’m going to provide a little more explanation.

The full notebook is available in my repo here.

Getting started with Local Interpretable Model-agnostic Explanations (LIME)

Before you get started, you’ll need to install Lime.

pip install lime

Next, let’s import our required libraries.

from sklearn.datasets import load_boston
import sklearn.ensemble
import numpy as np
from sklearn.model_selection import train_test_split
import lime
import lime.lime_tabular

Let’s load the sklearn dataset called ‘boston’. This data is a dataset that contains house prices that is often used for machine learning regression examples.

boston = load_boston()

Before we do much else, let’s take a look at the description of the dataset to get familiar with it.  You can do this by running the following command:

print boston['DESCR']

The output is:

Boston House Prices dataset
===========================
Notes
------
Data Set Characteristics:  
    :Number of Instances: 506 
    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target
    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's
    :Missing Attribute Values: None
    :Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**
   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International
         Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

Now that we have our data loaded, we want to build a regression model to forecast boston housing prices. We’ll use random forest for this to follow the example by the authors.

First, we’ll set up the RF Model and then create our training and test data using the train_test_split module from sklearn. Then, we’ll fit the data.

rf = sklearn.ensemble.RandomForestRegressor(n_estimators=1000)
train, test, labels_train, labels_test = train_test_split(boston.data, boston.target, train_size=0.80)
rf.fit(train, labels_train)

Now that we have a Random Forest Regressor trained, we can check some of the accuracy measures.

print('Random Forest MSError', np.mean((rf.predict(test) - labels_test) ** 2))

Tbe MSError is: 10.45. Now, let’s look at the MSError when predicting the mean.

print('MSError when predicting the mean', np.mean((labels_train.mean() - labels_test) ** 2))

From this, we get 80.09.

Without really knowing the dataset, its hard to say whether they are good or bad.  Since we are really most interested in looking at the LIME approach, we’ll move along and assume these are decent errors.

To implement LIME, we need to get the categorical features from our data and then build an ‘explainer’. This is done with the following commands:

categorical_features = np.argwhere(
    np.array([len(set(boston.data[:,x]))
    for x in range(boston.data.shape[1])]) <= 10).flatten()

and the explainer:

explainer = lime.lime_tabular.LimeTabularExplainer(train, 
                                                   feature_names=boston.feature_names, 
                                                   class_names=['price'], 
                                                   categorical_features=categorical_features, 
                                                   verbose=True, mode='regression')

Now, we can grab one of our test values and check out our prediction(s). Here, we’ll grab the 100th test value and check the prediction and see what the explainer has to say about it.

i = 100
exp = explainer.explain_instance(test[i], rf.predict, num_features=5)
exp.show_in_notebook(show_table=True)
LIME Explainer for regression
LIME Explainer for regression

So…what does this tell us?

It tells us that the 100th test value’s prediction is 21.16 with the “RAD=24” value providing the most positive valuation and the other features providing negative valuation in the prediction.

For regression, this isn’t quite as interesting (although it is useful). The LIME approach shows much more benefit (at least to me) when performing classification.

As an example, if you are trying to classify plants as edible or poisonous, LIME’s explanation is much more useful. Here’s an example from the authors.

LIME explanation of edible vs poisonous
LIME explanation of edible vs poisonous

Take a look at LIME when you have some time. Its a good library to add to your toolkit, especially if you are doing a lot of classification work. It makes it much easier to ‘explain’ what the model is doing.

Forecasting Time Series data with Prophet – Part 4

This is the fourth in a series of posts about using Forecasting Time Series data with Prophet. The other parts can be found here:

In those previous posts, I looked at forecasting monthly sales data 24 months into the future using some example sales data that you can find here.

In this post, I want to look at the output of Prophet to see how we can apply some metrics to measure ‘accuracy’.  When we start looking at ‘accuracy’ of forecasts, we can really do a whole lot of harm by using the wrong metrics and the wrong data to measure accuracy.  That said, its good practice to always try to compare your predicted values with your actual values to see how well or poorly your model(s) are performing.

For the purposes of this post, I’m going to expand on the data in the previous posts. For this post we are using fbprophet version 0.2.1.  Also – we’ll need scikit-learn and scipy installed for looking at some metrics.

Note: While I’m using Prophet to generate the models, these metrics and tests for accuracy can be used with just about any modeling approach.

Since the majority of the work has been covered in Part 3, I’m going to skip down to the metrics section…you can see the entire code and follow along with the jupyter notebook here.

In the notebook, we’ve loaded the data. The visualization of the data looks like this:

sales monthly data

Our prophet model forecast looks like:

sales monthly data forecast

Again…you can see all the steps in thejupyter notebook if you want to follow along step by step.

Now that we have a prophet forecast for this data, let’s combine the forecast with our original data so we can compare the two data sets.

metric_df = forecast.set_index('ds')[['yhat']].join(df.set_index('ds').y).reset_index()

The above line of code takes the actual forecast data ‘yhat’ in the forecast dataframe, sets the index to be ‘ds’ on both (to allow us to combine with the original data-set) and then joins these forecasts with the original data. lastly, we reset the indexes to get back to the non-date index that we’ve been working with (this isn’t necessary…just a step I took).

The new dataframe looks like this:

combined dataframe

You can see from the above, that the last part of the dataframe has “NaN” for ‘y’…that’s fine because we are only concerned about checking the forecast values versus the actual values so we can drop these “NaN” values.

metric_df.dropna(inplace=True)

Now, we have a dataframe with just the original data (in the ‘y’ column) and forecasted data (in the yhat column) to compare.

Now, we are going to take a look at a few metrics.

Metrics for measuring modeling accuracy

If you ask 100 different statisticians, you’ll probably get at least 50 different answers on ‘the best’ metrics to use for measuring accuracy of models.  For most cases, using either R-Squared, Mean Squared Error and Mean Absolute Error (or a combo of them all) will get you a good enough measure of the accuracy of your model.

For me, I like to use R-Squared and Mean Absolute Error (MAE).  With these two measures, I feel like I can get a really good feel for how well (or poorly) my model is doing.

Python’s ScitKit Learn has some good / easy methods for calculating these values.  To use them, you’ll need to import them (and have scitkit-learn and scipy installed). If you don’t have scitkit-learn and scipy installed, you can do so with the following command:

pip install scikit-learn scipy

Now, you can import the metrics with the following command:

from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

To calculate R-Squared, we simply do the following:

r2_score(metric_df.y, metric_df.yhat)

For this data, we get an R-Squared value of 0.99.   Now…this is an amazing value…it can be interpreted to mean that 99% of the variance in this data is explained by the model. Pretty darn good (but also very very naive in thinking). When I see an R-Squared value like this, I immediately think that the model has been overfit.   If you want to dig into a good read on what R-Squared means and how to interpret it, check out this post.

Now, let’s take a look at MSE.

mean_squared_error(metric_df.y, metric_df.yhat)

The MSE turns out to be 11,129,529.44. That’s a huge value…an MSE of 11 million tells me this model isn’t all that great, which isn’t surprising given the low number of data points used to build the model.  That said, a high MSE isn’t a bad thing necessarily but it give you a good feel for the accuracy you can expect to see.

Lastly, let’s take a look at MAE.

mean_absolute_error(metric_df.y, metric_df.yhat)

For this model / data, the MAE turns out to be 2,601.15, which really isn’t all that bad. What that tells me is that for each data point, my average magnitude of error is roughly $2,600, which isn’t all that bad when we are looking at sales values in the $300K to $500K range.  BTW – if you want to take a look at an interesting comparison of MAE and RMSE (Root Mean Squared Error), check out this post.

Hopefully this has been helpful.  It wasn’t the intention of this post to explain the intricacies of these metrics, but hopefully you’ve seen a bit about how to use metrics to measure your models. I may go into more detail on modeling / forecasting accuracies in the future at some point. Let me know if you have any questions on this stuff…I’d be happy to expand if needed.

Note: In the jupyter notebook,  I show the use of a new metrics library I found called ML-Metrics. Check it out…its another way to run some of the metrics.


If you want to learn more about time series forecating, here’s a few good books on the subject. These are Amazon links…I’d appreciate it if you used them if you purchase these books as the little bit of income that comes from these links helps pay for the server this blog runs on.

 

Collecting / Storing Tweets with Python and MongoDB

A good amount of the work that I do involves using social media content for analyzing networks, sentiment, influencers and other various types of analysis.

In order to do this type of analysis, you first need to have some data to analyze.  You can also scrape websites like Twitter or Facebook using simple web scrapers, but I’ve always found it easier to use the API’s that these companies / websites provide to pull down data.

The Twitter Streaming API is ideal for grabbing data in real-time and storing it for analysis. Twitter also has a search API that lets you pull down a certain number of historical tweets (I think I read it was the last 1,000 tweets…but its been a while since I’ve looked at the Search API).   I’m a fan of the Streaming API because it lets me grab a much larger set of data than the Search API, but it requires you to build a script that ‘listens’ to the API for your required keywords and then store those tweets somewhere for later analysis.

There are tons of ways to connect up to the Streaming API. There are also quite a few Twitter API wrappers for Python (and most of them work very well).   I tend to use Tweepy more than others due to its ease of use and simple structure. Additionally, if I’m working on a small / short-term project, I tend to reach for MongoDB to store the tweets using the PyMongo module. For larger / longer-term projects I usually connect the streaming API script to MySQL instead of MongoDB simply because MySQL fits into my ecosystem of backup scripts, etc better than MongoDB does.  MongoDB is perfectly suited for this type of work for larger projects…I just tend to swing toward MySQL for those projects.

For this post, I wanted to share my script for collecting Tweets from the Twitter API and storing them into MongoDB.

Note: This script is a mashup of many other scripts I’ve found on the web over the years. I don’t recall where I found the pieces/parts of this script but I don’t want to discount the help I had from other people / sites in building this script.

Collecting / Storing Tweets with Python and MongoDB

Let’s set up our imports:

from __future__ import print_function
import tweepy
import json
from pymongo import MongoClient

Next, set up your mongoDB path:

MONGO_HOST= 'mongodb://localhost/twitterdb'  # assuming you have mongoDB installed locally
                                             # and a database called 'twitterdb'

Next, set up the words that you want to ‘listen’ for on Twitter. You can use words or phrases seperated by commas.

WORDS = ['#bigdata', '#AI', '#datascience', '#machinelearning', '#ml', '#iot']

Here, I’m listening for words related to maching learning, data science, etc.

Next, let’s set up our Twitter API Access information.  You can set these up here.

CONSUMER_KEY = "KEY"
CONSUMER_SECRET = "SECRET"
ACCESS_TOKEN = "TOKEN"
ACCESS_TOKEN_SECRET = "TOKEN_SECRET"

Time to build the listener class.

class StreamListener(tweepy.StreamListener):    
    #This is a class provided by tweepy to access the Twitter Streaming API. 
    def on_connect(self):
        # Called initially to connect to the Streaming API
        print("You are now connected to the streaming API.")
 
    def on_error(self, status_code):
        # On error - if an error occurs, display the error / status code
        print('An Error has occured: ' + repr(status_code))
        return False
 
    def on_data(self, data):
        #This is the meat of the script...it connects to your mongoDB and stores the tweet
        try:
            client = MongoClient(MONGO_HOST)
            
            # Use twitterdb database. If it doesn't exist, it will be created.
            db = client.twitterdb
    
            # Decode the JSON from Twitter
            datajson = json.loads(data)
            
            #grab the 'created_at' data from the Tweet to use for display
            created_at = datajson['created_at']
            #print out a message to the screen that we have collected a tweet
            print("Tweet collected at " + str(created_at))
            
            #insert the data into the mongoDB into a collection called twitter_search
            #if twitter_search doesn't exist, it will be created.
            db.twitter_search.insert(datajson)
        except Exception as e:
           print(e)

Now that we have the listener class, let’s set everything up to start listening.

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
#Set up the listener. The 'wait_on_rate_limit=True' is needed to help with Twitter API rate limiting.
listener = StreamListener(api=tweepy.API(wait_on_rate_limit=True)) 
streamer = tweepy.Stream(auth=auth, listener=listener)
print("Tracking: " + str(WORDS))
streamer.filter(track=WORDS)

Now you are ready to go. The full script is below. You can store this script as “streaming_API.py” and run it as “python streaming_API.py” and – assuming you set up mongoDB and your twitter API key’s correctly, you should start collecting Tweets.

The Full Script:

from __future__ import print_function
import tweepy
import json
from pymongo import MongoClient
MONGO_HOST= 'mongodb://localhost/twitterdb'  # assuming you have mongoDB installed locally
                                             # and a database called 'twitterdb'
WORDS = ['#bigdata', '#AI', '#datascience', '#machinelearning', '#ml', '#iot']
CONSUMER_KEY = "KEY"
CONSUMER_SECRET = "SECRET"
ACCESS_TOKEN = "TOKEN"
ACCESS_TOKEN_SECRET = "TOKEN_SECRET"

class StreamListener(tweepy.StreamListener):    
    #This is a class provided by tweepy to access the Twitter Streaming API. 
    def on_connect(self):
        # Called initially to connect to the Streaming API
        print("You are now connected to the streaming API.")
 
    def on_error(self, status_code):
        # On error - if an error occurs, display the error / status code
        print('An Error has occured: ' + repr(status_code))
        return False
 
    def on_data(self, data):
        #This is the meat of the script...it connects to your mongoDB and stores the tweet
        try:
            client = MongoClient(MONGO_HOST)
            
            # Use twitterdb database. If it doesn't exist, it will be created.
            db = client.twitterdb
    
            # Decode the JSON from Twitter
            datajson = json.loads(data)
            
            #grab the 'created_at' data from the Tweet to use for display
            created_at = datajson['created_at']
            #print out a message to the screen that we have collected a tweet
            print("Tweet collected at " + str(created_at))
            
            #insert the data into the mongoDB into a collection called twitter_search
            #if twitter_search doesn't exist, it will be created.
            db.twitter_search.insert(datajson)
        except Exception as e:
           print(e)
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
#Set up the listener. The 'wait_on_rate_limit=True' is needed to help with Twitter API rate limiting.
listener = StreamListener(api=tweepy.API(wait_on_rate_limit=True)) 
streamer = tweepy.Stream(auth=auth, listener=listener)
print("Tracking: " + str(WORDS))
streamer.filter(track=WORDS)