If you haven’t taken a look at time-series databases, you should. For a lot of what we do today in data science, a time series database might make sense (e.g., stream processing,etc). While this article isn’t long, its a quick introduction to the topic.
This is a very good article describing some of the challenges that data scientist face today. Its not all rosy out there…there are a lot of challenges and issues. This is definitely worth a read…I wonder how many of you are seeing these in your current jobs?
BERT (Bidirectional Encoder Representations from Transformers) is Google’s deep learning algorithm for NLP (natural language processing). It helps computers and machines understand the language as we humans do. Put simply, BERT may help Google better understand the meaning of words in search queries.
If you’ve ever had to work on customer churn, you know it can be tough. This article describes how to use categorical features to understand customer churn. Its really good and worth the effort to read / follow.
When working with data and modeling, its sometimes hard to determine what model you should use for a particular modeling project. A quick way to find an algorithm that might work better than others is to run through an algorithm comparison loop to see how various models work against your data. In this post, I compare machine learning methods using a few different sklearn algorithms.
This article provides a good overview of the Data Natives 2019 – Europe meeting and the main trends being discussed for 2020 and beyond. For example, topics such as “AI and its use in Healthcare” and “AI and Ethics” looked like good talks.
An excellent review of “Ray”, a distributed computing system for python. Ray is:
is an open-source system for scaling Python applications from single machines to large clusters. Its design is driven by the unique needs of next-generation ML/AI systems, which face several unique challenges, including diverse computational patterns, management of distributed, evolving state, and the desire to address all those needs with minimal programming effort.
As always, Jason Brownlee does a great job explaining to begin to build an intuition for identifying imbalanced and skewed distributions – and how to handle / manage those distributions. One of the most difficult things to do in data science / machine learning is to understand and manage data with different distributions. You can’t always apply a model to a data-set because the distribution of said data makes that model invalid.
Scatter Plot of Binary Classification Dataset With A 1 to 10 Class Distribution – from here.
It should be no surprise that academia and industry approach data science and machine learning differently. In this article, some differences are described- they include: Accuracy, Training vs Production, Engineering focus (e.g., end-to-end pipeline development) and more.
A very good paper describing the challenges of technical debt with machine learning systems. The abstract:
Machine learning offers a fantastically powerful toolkit for building useful complex prediction systems quickly. This paper argues it is dangerous to think of these quick wins as coming for free. Using the software engineering framework of technical debt, we find it is common to incur massive ongoing maintenance costs in real-world ML systems. We explore several ML-specific risk factors to account for in system design. These include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, configuration issues, changes in the external world, and a variety of system-level anti-patterns.
A recent post I wrote describing how to perform market basket analysis using python and pandas. I provide a walk-through of using MLxtend’s apriori function as well as a ‘roll your own’ approach to market basket analysis.
When working with data and modeling, its sometimes hard to determine what model you should use for a particular modeling project. A quick way to find an algorithm that might work better than others is to run through an algorithm comparison loop to see how various models work against your data. In this post, I’ll be comparing machine learning methods using a few different sklearn algorithms. As always, you can find a jupyter notebook for this article on my github here and find other articles on this topic here.
I’ve used Jason Brownlee’s article from 2016 as the basis for this article…I wanted to expand a bit on what he did as well as use a different dataset. In this article, we’ll be using the Indian Liver Disease dataset (found here).
From the dataset page:
This data set contains 416 liver patient records and 167 non liver patient records collected from North East of Andhra Pradesh, India. The “Dataset” column is a class label used to divide groups into liver patient (liver disease) or not (no disease). This data set contains 441 male patient records and 142 female patient records.
Let’s get started by setting up our imports that we’ll use.
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (20,10)
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
Next, we’ll read in the data from the CSV file located in the local directory.
#read in the data
data = pd.read_csv('indian_liver_patient.csv')
If you do a head() of the dataframe, you’ll get a good feeling for the dataset.
We’ll use all columns except Gender for this tutorial. We could use gender by converting the gender to a numeric value (e.g., 0 for Male, 1 for Female) but for the purposes of this post, we’ll just skip this column.
data_to_use = data
The ‘Dataset’ column is the value we are trying to predict…whether the user has liver disease or not so we’ll that as our “Y” and the other columns for our “X” array.
values = data_to_use.values
Y = values[:,9]
X = values[:,0:9]
Before we run our machine learning models, we need to set a random number to use to seed them. This can be any random number that you’d like it to be. Some people like to use a random number generator but for the purposes of this, I’ll just set it to 12 (it could just as easily be 1 or 3 or 1023 or any other number).
random_seed = 12
Now we need to set up our models that we’ll be testing out. We’ll set up a list of the models and give them each a name. Additionally, I’m going to set up the blank arrays/lists for the outcomes and the names of the models to use for comparison.
We are going to use a k-fold validation to evaluate each algorithm and will run through each model with a for loop, running the analysis and then storing the outcomes into the lists we created above. We’ll use a 10-fold cross validation.
for model_name, model in models:
k_fold_validation = model_selection.KFold(n_splits=10, random_state=random_seed)
results = model_selection.cross_val_score(model, X, Y, cv=k_fold_validation, scoring='accuracy')
output_message = "%s| Mean=%f STD=%f" % (model_name, results.mean(), results.std())
From the above, it looks like the Logistic Regression, Support Vector Machine and Linear Discrimination Analysis methods are providing the best results (based on the ‘mean’ values). Taking Jason’s lead, we can take a look at a box plot to see what the accuracy is for each cross validation fold, we can see just how good each does relative to each other and their means.
From the box plot, when it is easy to see the three mentioned machine learning methods (Logistic Regression, Support Vector Machine and Linear Discrimination Analysis) are providing better accuracies. From this outcome, we can then take this data and start working with these three models to see how we might be able to optimize the modeling process to see if one model works a bit better than others.