Tag: machine learning

data roundup

Python Data Weekly Roundup – Dec 18 2019

In this week’s Python Data Weekly Roundup:

The Last Matplotlib Tweaking Guide You’ll Ever Need

This is a very good  ‘how to’ for beginners to learn to tweak the Matplotlib visualization library.  This article explains how to tweak matplotlib charts including changing the size, removing borders, changing colors and widths of chart lines.  Each tweak includes python code to make the tweaks.

Arithmetic, Geometric, and Harmonic Means for Machine Learning

Did you know there are different types of averages (aka means)?  After reading this article, you’ll have an understanding of what the difference is between the arithmetic, geometric and harmonic means are, why you should use one over the other and how to calculate them using python code.

What is My Data Worth?

Should you be paid for all the personal data that you’ve made available online? If so, what is that data worth?  In this fantastic article, Ruoxi Jia describes how to value personal data and describes how to apply the Shapley Value in data valuation and in general machine learning usages (e.g., interpreting black-box model predictions). An example of the Shapley Value is below. The below graphic shows two images from the article:

(a) The Shapley value produced by our proposed exact approach and the baseline Monte-Carlo approximation algorithm for the KNN classifier constructed with 1000 randomly selected training points from MNIST. (b) Runtime comparison of the two approaches as the training size increases.

Shapely Value Graph

FastSpeech: New text-to-speech model improves on speed, accuracy, and controllability

In this article, Microsoft Senior Research Xu Tan describes a new text-to speech model called FastSpeech. This new model is claimed to be fast, robust, controllable and high quality (which are all valuable and necessary features).   A deep dive of this model can be found here.

How to Develop Super Learner Ensembles in Python

Another great article from Jason Brownlee describing how to combine multiple models into an ensemble model for use in predictive modeling.  Jason provides python code that you can use to build your own Super Learner with scikit-learn. Additionally – and more importantly – Jason does a fantastic job of highlight the theory behind Super Learners with many links to articles and journals on the topic.

Strengthening the AI community

An overview of the DeepMind scholarship program as well as a description of why it makes sense to help others move into the field of AI.

Text Generation with Python

Natural Language Processing is well known as a way to analyze text.  I’ve written a bit about using NLP here on the site (see here and here). In this article, Julien Heiduk describes how he was able to use the GPT-2 model to generate text with python. In fact, the article is almost completely generated text via the GPT-2 model..and it does a good job of creating readable and understandable content.

Best Degree for Data Science (in One Picture)

Is there a ‘best’ degree for data science? Personally, I don’t think there is….but I can see there being better degrees for people that are just starting out.   For example, all things being equal on the personal front, a degree in statistics is going to be much better for you than a degree in horticulture…but…that’s not to say the statistics degree makes you a better data scientist…it just gives you the tools to get into the field quicker than someone with the horticulture degree. That said, I do like what Stephanie Glen says in this article when she writes: “getting a degree should be looked at as a stepping block, not a train ride to a destination. No single degree is likely to get you in the door.”

That’s it for this week’s Python Data Weekly Roundup. Subscribe to our newsletter to receive this weekly roundup in your email.

Comparing Machine Learning Methods

When working with data and modeling, its sometimes hard to determine what model you should use for a particular modeling project.  A quick way to find an algorithm that might work better than others is to run through an algorithm comparison loop to see how various models work against your data. In this post, I’ll be comparing machine learning methods using a few different sklearn algorithms.  As always, you can find a jupyter notebook for this article on my github here and find other articles on this topic here.

I’ve used Jason Brownlee’s article from 2016 as the basis for this article…I wanted to expand a bit on what he did as well as use a different dataset. In this article, we’ll be using the Indian Liver Disease dataset (found here).

From the dataset page:

This data set contains 416 liver patient records and 167 non liver patient records collected from North East of Andhra Pradesh, India. The “Dataset” column is a class label used to divide groups into liver patient (liver disease) or not (no disease). This data set contains 441 male patient records and 142 female patient records.

Let’s get started by setting up our imports that we’ll use.

import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (20,10)

from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

Next, we’ll read in the data from the CSV file located in the local directory.

#read in the data
data = pd.read_csv('indian_liver_patient.csv')

If you do a head() of the dataframe, you’ll get a good feeling for the dataset.

Indian Liver Disease Data

We’ll use all columns except Gender for this tutorial. We could use gender by converting the gender to a numeric value (e.g., 0 for Male, 1 for Female) but for the purposes of this post, we’ll just skip this column.

data_to_use = data
del data_to_use['Gender']

The ‘Dataset’ column is the value we are trying to predict…whether the user has liver disease or not so we’ll that as our “Y” and the other columns for our “X” array.

values = data_to_use.values

Y = values[:,9]
X = values[:,0:9]

Before we run our machine learning models, we need to set a random number to use to seed them. This can be any random number that you’d like it to be. Some people like to use a random number generator but for the purposes of this, I’ll just set it to 12 (it could just as easily be 1 or 3 or 1023 or any other number).

random_seed = 12

Now we need to set up our models that we’ll be testing out. We’ll set up a list of the models and give them each a name. Additionally, I’m going to set up the blank arrays/lists for the outcomes and the names of the models to use for comparison.

outcome = []
model_names = []
models = [('LogReg', LogisticRegression()), 
          ('SVM', SVC()), 
          ('DecTree', DecisionTreeClassifier()),
          ('KNN', KNeighborsClassifier()),
          ('LinDisc', LinearDiscriminantAnalysis()),
          ('GaussianNB', GaussianNB())]

We are going to use a k-fold validation to evaluate each algorithm and will run through each model with a for loop, running the analysis and then storing the outcomes into the lists we created above. We’ll use a 10-fold cross validation.

for model_name, model in models:
    k_fold_validation = model_selection.KFold(n_splits=10, random_state=random_seed)
    results = model_selection.cross_val_score(model, X, Y, cv=k_fold_validation, scoring='accuracy')
    output_message = "%s| Mean=%f STD=%f" % (model_name, results.mean(), results.std())

The output from this loop is:

LogReg| Mean=0.718633 STD=0.058744
SVM| Mean=0.715124 STD=0.058962
DecTree| Mean=0.637568 STD=0.108805
KNN| Mean=0.651301 STD=0.079872
LinDisc| Mean=0.716878 STD=0.050734
GaussianNB| Mean=0.554719 STD=0.081961

From the above, it looks like the Logistic Regression, Support Vector Machine and Linear Discrimination Analysis methods are providing the best results (based on the ‘mean’ values). Taking Jason’s lead, we can take a look at a box plot to see what the accuracy is for each cross validation fold, we can see just how good each does relative to each other and their means.

fig = plt.figure()
fig.suptitle('Machine Learning Model Comparison')
ax = fig.add_subplot(111)

Machine Learning Comparison

From the box plot, when it is easy to see the three mentioned machine learning methods (Logistic Regression, Support Vector Machine and Linear Discrimination Analysis) are providing better accuracies. From this outcome, we can then take this data and start working with these three models to see how we might be able to optimize the modeling process to see if one model works a bit better than others.

Book Review – Machine Learning With Random Forests And Decision Trees by Scott Hartshorn

Machine Learning With Random Forests And Decision Trees: A Mostly Intuitive Guide, But Also Some PythonI just finished reading Machine Learning With Random Forests And Decision Trees: A Mostly Intuitive Guide, But Also Some Python (amazon affiliate link).

The short review

This is a great introductory book for anyone looking to learn more about Random Forests and Decision Trees. You won’t be an expert after reading this book, but you’ll understand the basic theory and and how to implement random forests in python.

The long(ish) review

This is a short book – only 76 pages. But…those 76 pages are full of good, introductory information on Random Forests and Decision Trees.  Even though I’ve been using random forests and other machine learning approaches in python for years, I can easily see value for people that are just starting out with machine learning and/or random forests. That said, there were a few things in the book that I had either forgotten or didn’t know (Entropy Criteria for example).

While the entire book is excellent, the section on Feature Importance is the best in the book.  This section provides a very good description of the ‘why’ and the ‘how’ of feature importance (and therefore, feature selection) for use in random forests and decision trees.  There are some very good points made in this section regarding how to get started with feature selection and cross validation.

Additionally, the book provides a decent overview of the idea of ‘out-of-sample’ (or ‘Out-of-bag’) data.  I’m a huge believer in keeping some data out of your initial training data set to use for validation after you’ve built your models.

If you’re looking for a good introductory book on random forests and decision trees, pick this one up ( (amazon affiliate link)) …its only $2.99 for the kindle version.  Like I mentioned earlier, this book won’t make you an expert but it will provide a solid grounding to get started on the topic of random forests, decision trees and machine learning.

One negative comment I have on this book is that there is very little python in the book. The book isn’t marketed as strictly a python book, but I would have expected a bit more python in the book to help drive home some of the theory with runnable code. That said, this is a very small negative to the book overall.