# Tag: data science

## Comparing Machine Learning Methods

When working with data and modeling, its sometimes hard to determine what model you should use for a particular modeling project.  A quick way to find an algorithm that might work better than others is to run through an algorithm comparison loop to see how various models work against your data. In this post, I’ll be comparing machine learning methods using a few different sklearn algorithms.  As always, you can find a jupyter notebook for this article on my github here and find other articles on this topic here.

I’ve used Jason Brownlee’s article from 2016 as the basis for this article…I wanted to expand a bit on what he did as well as use a different dataset. In this article, we’ll be using the Indian Liver Disease dataset (found here).

From the dataset page:

This data set contains 416 liver patient records and 167 non liver patient records collected from North East of Andhra Pradesh, India. The “Dataset” column is a class label used to divide groups into liver patient (liver disease) or not (no disease). This data set contains 441 male patient records and 142 female patient records.

Let’s get started by setting up our imports that we’ll use.

```import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (20,10)
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

```

Next, we’ll read in the data from the CSV file located in the local directory.

```#read in the data

If you do a `head()` of the dataframe, you’ll get a good feeling for the dataset.

We’ll use all columns except Gender for this tutorial. We could use gender by converting the gender to a numeric value (e.g., 0 for Male, 1 for Female) but for the purposes of this post, we’ll just skip this column.

```data_to_use = data
del data_to_use['Gender']
data_to_use.dropna(inplace=True)```

The ‘Dataset’ column is the value we are trying to predict…whether the user has liver disease or not so we’ll that as our “Y” and the other columns for our “X” array.

```values = data_to_use.values
Y = values[:,9]
X = values[:,0:9]```

Before we run our machine learning models, we need to set a random number to use to seed them. This can be any random number that you’d like it to be. Some people like to use a random number generator but for the purposes of this, I’ll just set it to 12 (it could just as easily be 1 or 3 or 1023 or any other number).

`random_seed = 12`

Now we need to set up our models that we’ll be testing out. We’ll set up a list of the models and give them each a name. Additionally, I’m going to set up the blank arrays/lists for the outcomes and the names of the models to use for comparison.

```outcome = []
model_names = []
models = [('LogReg', LogisticRegression()),
('SVM', SVC()),
('DecTree', DecisionTreeClassifier()),
('KNN', KNeighborsClassifier()),
('LinDisc', LinearDiscriminantAnalysis()),
('GaussianNB', GaussianNB())]```

We are going to use a k-fold validation to evaluate each algorithm and will run through each model with a for loop, running the analysis and then storing the outcomes into the lists we created above. We’ll use a 10-fold cross validation.

```for model_name, model in models:
k_fold_validation = model_selection.KFold(n_splits=10, random_state=random_seed)
results = model_selection.cross_val_score(model, X, Y, cv=k_fold_validation, scoring='accuracy')
outcome.append(results)
model_names.append(model_name)
output_message = "%s| Mean=%f STD=%f" % (model_name, results.mean(), results.std())
print(output_message)```

The output from this loop is:

```LogReg| Mean=0.718633 STD=0.058744
SVM| Mean=0.715124 STD=0.058962
DecTree| Mean=0.637568 STD=0.108805
KNN| Mean=0.651301 STD=0.079872
LinDisc| Mean=0.716878 STD=0.050734
GaussianNB| Mean=0.554719 STD=0.081961```

From the above, it looks like the Logistic Regression, Support Vector Machine and Linear Discrimination Analysis methods are providing the best results (based on the ‘mean’ values). Taking Jason’s lead, we can take a look at a box plot to see what the accuracy is for each cross validation fold, we can see just how good each does relative to each other and their means.

```fig = plt.figure()
fig.suptitle('Machine Learning Model Comparison')
plt.boxplot(outcome)
ax.set_xticklabels(model_names)
plt.show()```

From the box plot, when it is easy to see the three mentioned machine learning methods (Logistic Regression, Support Vector Machine and Linear Discrimination Analysis) are providing better accuracies. From this outcome, we can then take this data and start working with these three models to see how we might be able to optimize the modeling process to see if one model works a bit better than others.

## Data Analytics & Python

So you want (or need) to analyze some data. You’ve got some data in an excel spreadsheet or database somewhere and you’ve been asked to take that data and do something useful with it. Maybe its time for data analytics & Python?

Maybe you’ve been asked to build some models for predictive analytics. Maybe you’ve been asked to better understand your customer base based on their previous purchases and activity.  Perhaps you’ve been asked to build a new business model to generate new revenue.

Where do you start?

You could go out and spend a great deal of money on systems to help you in your analytics efforts, or you could start with tools that are available to you already.  You could open up excel, which is very much overlooked by people these days for data analytics. Or…you could install open source tools (for free!) and begin hacking away.

When I was in your shoes in my first days playing around with data, I started with excel. I quickly moved on to other tools because the things I needed to do seemed difficult to accomplish in excel. I then installed R and began to learn ‘real’ data analytics (or so I thought).

I liked (and still do like) R, but it never felt like ‘home’ to me.  After a few months poking around in R, I ran across python and fell in love. Python felt like home to me.

With python, I could quickly cobble together a script to do just about anything I needed to do. In the 5+ years I’ve been working with python now, I’ve not found anything that I cannot do with python and freely available modules.

Need to do some time series analysis and/or forecasting? Python and statsmodels (along with others).

Need to do some natural language processing?  Python and NLTK (along with others).

Need to do some machine learning work? Python and sklearn (along with others).

You don’t HAVE to use python for data analysis. R is perfectly capabale of doing the same things python is – and in some cases, R has more capabilities than python does because its been used an analytics tool for much longer than python has.

That said, I prefer python and use python in everything I do. Data analytics & python go together quite well.