data roundup

Python Data Weekly Roundup – Dec 27 2019

In this week’s Python Data Weekly Roundup:

Picks On AI Trends from Data Natives 2019

This article provides a good overview of the Data Natives 2019 – Europe meeting and the main trends being discussed for 2020 and beyond.  For example, topics such as “AI and its use in Healthcare” and “AI and Ethics” looked like good talks.

Ray for the Curious

An excellent review of “Ray”, a distributed computing system for python.  Ray is:

is an open-source system for scaling Python applications from single machines to large clusters. Its design is driven by the unique needs of next-generation ML/AI systems, which face several unique challenges, including diverse computational patterns, management of distributed, evolving state, and the desire to address all those needs with minimal programming effort.

Develop an Intuition for Severely Skewed Class Distributions

As always, Jason Brownlee does a great job explaining to begin to build an intuition for identifying imbalanced and skewed distributions – and how to handle / manage those distributions. One of the most difficult things to do in data science / machine learning is to understand and manage data with different distributions. You can’t always apply a model to a data-set because the distribution of said data makes that model invalid.

Scatter Plot of Binary Classification Dataset With A 1 to 10 Class Distribution

Scatter Plot of Binary Classification Dataset With A 1 to 10 Class Distribution – from here.

Seven differences between academia and industry for building machine learning and deep learning models

It should be no surprise that academia and industry approach data science and machine learning differently.  In this article, some differences are described- they include:  Accuracy, Training vs Production, Engineering focus (e.g., end-to-end pipeline development) and more.

Hidden Technical Debt in Machine Learning Systems  — PDF

A very good paper describing the challenges of technical debt with machine learning systems. The abstract:

Machine learning offers a fantastically powerful toolkit for building useful complex prediction systems quickly. This paper argues it is dangerous to think of these quick wins as coming for free. Using the software engineering framework of technical debt, we find it is common to incur massive ongoing maintenance costs in real-world ML systems. We explore several ML-specific risk factors to account for in system design. These include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, configuration issues, changes in the external world, and a variety of system-level anti-patterns.

Market Basket Analysis with Python and Pandas

A recent post I wrote describing how to perform market basket analysis using python and pandas.  I provide a walk-through of using MLxtend’s apriori function as well as a ‘roll your own’ approach to market basket analysis.