Category: Weekly Roundup

data roundup

Python Data Weekly Roundup – Jan 10 2020

In this week’s Python Data Weekly Roundup:

A Comprehensive Learning Path to Understand and Master NLP in 2020

If you’re looking to learn more about Natural Language Processing (NLP) in 2020, this is a very good article describing a good learning path to take including links to articles, courses, videos and more to get you started down the road of becoming proficient with the tools and methods of NLP.

The Best of Both Worlds: Forecasting US Equity Market Returns using a Hybrid Machine Learning – Time Series Approach


Predicting long-term equity market returns is of great importance for investors to strategically allocate their assets. We apply machine learning methods to forecast 10-year-ahead U.S. stock returns and compare the results to traditional Shiller regression-based forecasts more commonly used in the asset-management industry. Machine-learning forecasts have similar forecast errors to a traditional return forecast model based on lagged CAPE ratios. However, machine-learning forecasts have higher forecast errors than the regression-based, two-step approach of Davis et al [2018] that forecasts the CAPE ratio based on macroeconomic variables and then imputes stock returns. When we combine our two-step approach with machine learning to forecast CAPE ratios (a hybrid ML-VAR approach), U.S. stock return forecasts are statistically and economically more accurate than all other approaches. We discuss why and conclude with some best practices for both data scientists and economists in making real-world investment return forecasts.

 Improving U.S. stock return forecasts: A “fair-value” CAPE approach
Source: Improving U.S. stock return forecasts: A “fair-value” CAPE approach

Building machine learning workflows with AWS Data Exchange and Amazon SageMaker

This article describes how to use AWS’ Sagemaker and Data Exchagne to build a machine learning model and machine learning workflows.   What I found interesting is the ability to use AWS Data Exchange to find a large number of different types of data.

Tutorial: Python Regex (Regular Expressions) for Data Scientists

I hate regex. Of course I love the functionality and capabilities of using regex, but I loathe my inability to come up with my own regex ‘formulas’. I *always* have to go out on the web to search for how to do what I’m trying to do.  This article doesn’t solve that problem for me, but it does provide a refresher in regex patterns and a reminder why regex is important.

That’s it for this week’s Python Data Weekly Roundup. Subscribe to our newsletter to receive this weekly roundup in your email.


data roundup

Python Data Weekly Roundup – Jan 3 2020

In this week’s Python Data Weekly Roundup:

It’s time for Time-series Databases

If you haven’t taken a look at time-series databases, you should. For a lot of what we do today in data science, a time series database might make sense (e.g., stream processing,etc).  While this article isn’t long, its a quick introduction to the topic.

5 Key Reasons Why Data Scientists Are Quitting their Jobs

This is a very good article describing some of the challenges that data scientist face today. Its not all rosy out there…there are a lot of challenges and issues. This is definitely worth a read…I wonder how many of you are seeing these in your current jobs?

What is BERT and how does it Work?

What is BERT?

BERT (Bidirectional Encoder Representations from Transformers) is Google’s deep learning algorithm for NLP (natural language processing). It helps computers and machines understand the language as we humans do. Put simply, BERT may help Google better understand the meaning of words in search queries.

Understanding Customer Attrition Using Categorical Features in Python

If you’ve ever had to work on customer churn, you know it can be tough. This article describes how to use categorical features to understand customer churn. Its really good and worth the effort to read / follow.

Comparing Machine Learning Methods

When working with data and modeling, its sometimes hard to determine what model you should use for a particular modeling project.  A quick way to find an algorithm that might work better than others is to run through an algorithm comparison loop to see how various models work against your data. In this post, I compare machine learning methods using a few different sklearn algorithms.

Machine Learning Comparison

10 AI trends to watch in 2020

What’s happening in artificial intelligence in the year ahead? Look for modeling at the edge, new attention to data governance, and continued talent wars, among key AI trends.

That’s it for this week’s Python Data Weekly Roundup. Subscribe to our newsletter to receive this weekly roundup in your email.


data roundup

Python Data Weekly Roundup – Dec 27 2019

In this week’s Python Data Weekly Roundup:

Picks On AI Trends from Data Natives 2019

This article provides a good overview of the Data Natives 2019 – Europe meeting and the main trends being discussed for 2020 and beyond.  For example, topics such as “AI and its use in Healthcare” and “AI and Ethics” looked like good talks.

Ray for the Curious

An excellent review of “Ray”, a distributed computing system for python.  Ray is:

is an open-source system for scaling Python applications from single machines to large clusters. Its design is driven by the unique needs of next-generation ML/AI systems, which face several unique challenges, including diverse computational patterns, management of distributed, evolving state, and the desire to address all those needs with minimal programming effort.

Develop an Intuition for Severely Skewed Class Distributions

As always, Jason Brownlee does a great job explaining to begin to build an intuition for identifying imbalanced and skewed distributions – and how to handle / manage those distributions. One of the most difficult things to do in data science / machine learning is to understand and manage data with different distributions. You can’t always apply a model to a data-set because the distribution of said data makes that model invalid.

Scatter Plot of Binary Classification Dataset With A 1 to 10 Class Distribution

Scatter Plot of Binary Classification Dataset With A 1 to 10 Class Distribution – from here.

Seven differences between academia and industry for building machine learning and deep learning models

It should be no surprise that academia and industry approach data science and machine learning differently.  In this article, some differences are described- they include:  Accuracy, Training vs Production, Engineering focus (e.g., end-to-end pipeline development) and more.

Hidden Technical Debt in Machine Learning Systems  — PDF

A very good paper describing the challenges of technical debt with machine learning systems. The abstract:

Machine learning offers a fantastically powerful toolkit for building useful complex prediction systems quickly. This paper argues it is dangerous to think of these quick wins as coming for free. Using the software engineering framework of technical debt, we find it is common to incur massive ongoing maintenance costs in real-world ML systems. We explore several ML-specific risk factors to account for in system design. These include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, configuration issues, changes in the external world, and a variety of system-level anti-patterns.

Market Basket Analysis with Python and Pandas

A recent post I wrote describing how to perform market basket analysis using python and pandas.  I provide a walk-through of using MLxtend’s apriori function as well as a ‘roll your own’ approach to market basket analysis.