Notes for Introduction to Machine Learning in Production (MLOps1) on Coursera/Deeplearning.ai

2021-06-07

Week 1: Steps of an ML project

The ML project lifecycle

MLOps (Machine learning Operations) comprises a set of tools and principles to support progress through the ML project lifcycle.

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled.png

Scoping

Decide to work on speech recognition for voice search
Decide on key metrics:
- Acc, latency, throughput
Estimate resources and timeline

Define data

Is the data labeled consistently
How much silence before/after each clip?
How to perform volume normalization?

Modeling

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%201.png

Deployment

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%202.png

Data drift and concept drift

Data drift: the input data has changed. The distribution of the variables is meaningfully different. As a result, the trained model is not relevant for this new data.

Concept drift occurs when the patterns the model learned no longer hold.

In contrast to the data drift, the distributions (such as user demographics, frequency of words, etc.) might even remain the same. Instead, the relationships between the model inputs and outputs change.

Software engineering issues

Realtime of Batch
Cloud vs. Edge/Browser
Compute resources (CPU/GPU/memory)
Latency, throughput (QPS)
Logging
Security and privacy

Common deployment cases

New product/capability
Automate/assist with manual task
Replace previous ML system

Key idesa:

Gradual ramp up with monitoring
Rollback

Shadow mode

ML system shadows the human and runs in parallel.
Ml system’s output not used for any decisions during this phase.

Canary deployment

Roll out to small fraction (say 5%) of traffic initially.
Monitor system and ramp up traffic gradually.

Blue green deployment

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%203.png

Easy way to enable rollback

Degrees of automation

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%204.png

Monitoring dashboard

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%205.png

Brainstorm the things that could go wrong.
Brainstorm a few statistics/metrics that will detect the problem.
It is ok to use many metrics initially and gradually remove the ones you find not useful.

Examples of metrics to track

Software metrics: memory, compute, latency, throughout, server load
Input metrics: avg input length, avg input volume, num missing values, avg image brightness
Output metrics # times return ‘'(null), # times user

Just as ML modeling is iterative, so is deployment

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%206.png

Monitoring dashboard

Set thresholds for alarms
Adapt metrics and thresholds over time

Model maintenance

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%207.png

Manual retraining
Automatic retraining

Metrics to monitor

Monitor
- Software metrics
- Input metrics
- Output metrics
How quickly do they change?
- User data generally has slower drift.
- enterprise data (B2B applications) can shift fast.

Code Link : Github Repo

Reading Material Week 1:

Week 2: Select and train model

Selecting and Training a Model

Model development is an iterative process

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%208.png

Challenges in model development

Doing well on training set (usually measured by average training error)
Doing well on dev/test sets.
Doing well on business metrics/project goals.

Performance on disproportionately important example

Web search example

Informational and transactional queries
Navigational queries

Performance on key slices of the dataset

Example: ML for loan approval: makes sure not to discriminate by ethnicity, gender, location, language of other protected attributes.
Example: Product recommendations from retailers: Be careful to treat fairly all major user, retailer, and product categories.

Rare classes

Skewed data distribution
Accuracy in rare classes

Unfortunate conversation in many companies

I did well on the test set
But this doesn’t work for my application

Establishing a baseline level of performance

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%209.png

Unstructured and structured data

Unstructured data: Image, Audio, Text (HLP is important)
Structured data: a data frame

Ways to establish a baseline

Human level performance
Literature search for state-of-the-art/open source
Quick-and-dirty implementation
Performance of older system

Baseline helps to indicates what might be possible. In some cases (such as HLP) is also gives a sense of what is irreducible error/Bayes error.

Getting started on modeling

Literature search to see what’s possible (courses, blogs, open-source projects).
Find open-source implementations if available.
A reasonable algorithm with good data will often outperform a great algorithm with no so good data.

Deployment constraints when picking a model

Should you take into account deployment constraints when picking a model?

Yes, if baseline is already established and goal is to build and deploy.
No (or not necessarily), if purpose is to establish a baseline and determine what is possible and might be worth pursuing.

Sanity-check for code and algorithm

Try to overfit a small training dataset before training on a large one.

Error analysis and performance auditing

Error analysis example

Speech recognition example

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%2010.png

Iterative process of error analysis

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%2011.png

Useful metrics for each tag

What fraction of errors has that tag?
Of all data with that tag, what fraction is misclassified?
What fraction of all the data has that tag?
How much room for improvement is there on data with that tag?

Prioritizing what to work on

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%2012.png

Decide on most important categories to work on based on:

How much room for improvement there is
How frequently that category appears
How easy is to improve accuracy in that category
How important it is to improve in that category

Adding/improving data for specific categories

For categories you want to prioritize:

Collect more data
Use data augmentation to get more data
Improve label accuracy/data quality

Skewed datasets

Confusion matrix: Precision and Recall

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%2013.png

What happens with print(‘0’)?

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%2014.png

Combining precision and recall — F1 score

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%2015.png

Multi-class metrics

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%2016.png

Performance auditing

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%2017.png

Auditing framework

Check for accuracy, fairness/bias, and other problems.

Brainstorm the ways the system might go wrong.
- Performance on subsets of data (e.g., ethnicity, gender).
- How common are certain errors (e.g., FP, FN).
- Performance on rare classes.
Establish metrics to assess performance against these issues on appropriate slices of data.
Get business/product owner buy-in.

Speech recognition example

Brainstorm the ways the system might go wrong.
- Accuracy on different genders and ethnicities.
- Accuracy on different devices.
- Prevalence of rude mis-transcriptions.
Establish metrics to assess performance against these issues on appropriate slices of data.
- Mean accuracy for different genders and major accents.
- Mean accuracy on different devices.
- Check for prevalence of offensive words in the output.

Data iteration

Data-centric AI development

Model-centric view: take the data you have, and develop a model that does as well as possible on it.
- Hold the data fixed and iteratively improve the code/model.
Data-centric view: the quality of the data is paramount. Use tools to improve the data quality; this will allow multiple models to do well.
- Hold the code fixed and iteratively improve the data.

A useful picture of data augmentation

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%2018.png

Data augmentation

Goal:
- Create realistic examples that (i) the algorithm does poorly on, but (ii) humans (or other baseline) do well on
Checklist:
- Does it sound realistic?
- Is the x→ y mapping clear? (e.g. can humans recognize speech?)
- Is the algorithm currently doing poorly on it?

Data iteration loop

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%2020.png

Can adding data hurt performance?

For unstructured data problems, if:

The model is large (low bias).
The mapping x→y is clear (e.g., given only the input x, humans can make accurate predictions).

Then, adding data rarely hurts accuracy.

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%2021.png

Adding features to structured data

Restaurant recommendation example:
- Vegan are frequently recommended restaurants with only meat options.
- Possible features to add?
  - Is person vegan (based on past orders)?
  - Does restaurant have vegan options (based on menu)?
Other food delivery examples
- Only tea/coffee and only pizza
- What are the added features that can help make a decision?
- Product recommendation:
Collaborative filtering ——> Content based filtering (cold-start)

Data iteration

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%2022.png

Error analysis can be harder if there is not good baseline (such as HLP) to compare to.
Error analysis, user feedback and benchmarking to competitors can all provide inspiration for features to add.

Experiment tracking

What to track?
- Algorithm/code versioning
- Dataset used
- Hyperparameters
- Results
Tracking tools
- Text files
- spreadsheet
- Experiment tracking system
Desirable features
- Information needed to replicate results
- Experiment results, ideally with summary metrics/analysis
- Perhaps also: resource monitoring, visualization, model error analysis

From big data to good data

Try to ensure consistently high-quality data in all phases of the ML project lifecycly

Good data:

Covers important cases (good coverage of inputs x)
Is defined consistently (definition of labels y is unambiguous)
Has timely feedback from production data (distribution covers data drift and concept drift)
Is sized appropriately

Code Link : Google Colab

Reading Material Week 2:

Week 3: Data Definition and Baseline

Define Data and Establish Baseline

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%2023.png

Data definition questions

What is the input x?
- lightning? contrast? resolution?
- What features need to be in included?
What is the target label y?
- How can we ensure labelers give consistent labels?

Major types of data problems

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%2024.png

Unstructured vs. structured data

Unstructured data
- may or may not have huge collection of unlabeled examples x.
- Humans can label more data.
- Data augmentation more likely to be helpful.
Structured data
- May be more difficult to obtain more data.
- Human labelling may not be possible (with some exceptions)

Small data vs. big data

Small data
- Clean labels are critical
- Can manually look through dataset and fix labels
- Can get all the labelers to talk to each other
Big data
- Emphasis data process

Why label consistency is important?

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%2025.png

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%2026.png

Big data problems can have small data challenges too

Problems with a large dataset but where there’s a long tail or rare events in the input will have small data challenges too.

Web search
Self-driving cars
Product recommendation systems

Improving label consistency

Have multiple labelers label same example.
When there is disagreement, have MLE, subject matter expert (SME) and/or labelers discuss definition of y to reach agreement.
If labelers believe that x doesn’t contain enough information, consider changing x.
Iterate until it is hard to significantly increase agreement

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%2027.png

Have a class/label to capture uncertainty

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%2028.png

Small data vs. big data (unstructured data)

Small data
- Usually small number of labelers
- Can ask labelers to discuss specific labels
Big data
- Get to consistent definition with a small group.
- Then send labeling instructions to labelers.
- Can consider having multiple labelers label every example and using voting or consensus labels to increase accuracy.

Why measure HLP?

Estimate Bayes error / irreducible error to help with error analysis and prioritization.

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%2029.png

Other uses of HLP

In academia, establish and beat a respectable benchmark to support publication.
Business or product owner asks for 99% accuracy. HLP helps establish a more reasonable target.
“Prove” the ML system is superior to humans doing the job and thus the business or product owner should adopt it. (Use with caution)

The problem with beating HLP as a ‘proof’ or ML ‘superiority’

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%2030.png

Raising HLP

When the ground truth label is externally defined, HLP gives an estimate for Bayes error / irreducible error.

But often ground truth is just anther human label.

When the label y comes from a human label, HLP « 100% may indicate ambiguous labeling instructions.
Improving label consistency will raise HLP
This makes it harder for ML to beat HLP. But the more consistent labels will raise ML performance, which is ultimately likely to benefit the actual application performance.

HLP on structured data

Structured data problems are less likely to involve human labelers, thus HLP is less frequently used.

Some exceptions:

User ID merging: same person?
Based on network traffic, is the computer hacked?
Is the transaction fraudulent?
Spam account? Bot?
From GPS, what is the mode transportation - on foot, bike, car, bus?

Label and Organize Data

How long should you spend obtaining data?

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%2031.png

Get into this iteration loop as quickly as possible.
Instead of asking: how long it would take to obtain m examples? ask: How much data can we obtain in k days.
Exception: if you have worked on the problem before and from experience you know you need m examples.

Inventory data

Brainstorm list of data sources

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%2032.png

Other factors: data quality, privacy, regulatory constrains

Labeling data

Options: in-house vs. outsourced vs. crowdsourced
Having MLEs label data expensive. But doing this for just a few days is usually fine
Who is qualified to label?
- Speech recognition - any reasonable fluent speaker
- Factory inspection, medical image diagnosis - SME (subject matter expert)
- Recommender systems - maybe impossible to label well
Don’t increase data by more than 10x at a time

Data pipeline example

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%2033.png

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%2034.png

POC and Production phases

POC(proof-of-concept):
- Goal is to decide if the application is workable and worth deploying.
- Focus on getting the prototype to work
- It’s ok if data pre-processing is manual. But take extensive notes/comments
Production phase:
After project utility is established, use more sophisticated tools to make sure the data pipeline is replicable.
E.g., Tensor Flow Transform, Apache Beam, Airflow, …

Data pipeline example

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%2035.png

Meta-data

Examples:
- Manufacturing visual inspection: time, factory, line #, camera settings, phone model, inspector ID,…
- Speech recognition: device type, labeler ID, VAD model ID,…
Useful for:
- Error analysis. Spotting unexpected effects.
- Keeping track of data provenance.

Balanced train/dev/test/ splits in small data problems

Visual inspection example: 100 examples, 30 positive (defective)

Train/dev/test: 60%/20%/20%
Random split: positive example: 21/2/7 (35%/10%/35%)→ dev set is not representative
Want: 18/6/6 (30%/30%/30%) →balanced split
No need to worry about this with large datasets - a random split will be representative

Scooping

What is scoping?

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%2036.png

Scoping process

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%2037.png

Separating problem identification from

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%2038.png

Feasibility: Is this project technically feasible?

Use external benchmark (literature, other company, competitor)

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%2039.png

Why use HLP to benchmark?

People are very good on unstructured data tasks

Criteria: can a human, given the same data, perform the task?

Do we have features that are predictive?

Given past purchases, predict future purchases ✅
Given weather, predict shopping mall foot traffic ✅
Given DNA info, predict heart disease ❓
Given social media chatter, predict demand for a clothing style ❓
Given history of stock’s price, predict future price of that stock ❌

History of project

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%2040.png

Diligence on value

Notes%20for%20Introduction%20to%20Machine%20Learning%20in%20Prod%2055eeb7a65b57401ba9ba2321757b925b/Untitled%2041.png

Ethical considerations

Is this project creating net positive societal value?
Is this project reasonable fair and free from bias?
Have any ethical concerns been openly aired and debated?

Milestones & Resourcing

Key specifications:

ML metrics (accuracy, precision/recall, etc.)
Software metrics (latency, throughput, etc. given compute resources)
Business metrics (revenue, etc.)
Resources needed (data, personnel, help from other teams)
Timeline

If unsure, consider benchmarking to other projects, or building a POC (Proof of Concept) first.

Colab: https://www.coursera.org/learn/introduction-to-machine-learning-in-production/ungradedLab/hnDmK/data-labeling/lab

Reading Material Week 3:

Label ambiguity

https://arxiv.org/pdf/1706.06969.pdf

Data pipelines

Data lineage

MLops

Overall resources:

Konstantinos, Katsiapis, Karmarkar, A., Altay, A., Zaks, A., Polyzotis, N., … Li, Z. (2020). Towards ML Engineering: A brief history of TensorFlow Extended (TFX). http://arxiv.org/abs/2010.02013

Paleyes, A., Urma, R.-G., & Lawrence, N. D. (2020). Challenges in deploying machine learning: A survey of case studies. http://arxiv.org/abs/2011.09926

Python Machine learning