SageMaker: Model Training, Tuning, and Evaluation

Contents:

Automatic Model Tuning (AMT)

Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45285731

❗ Hyperparameter tuning is far from trivial!
- 💡 ML Engineers often have to experiment (trial & error)
  - What are best hyperparameters? Best values for hyperparameters? Depth of NN? …
  - How to avoid overfitting, getting stuck in local minima…?
  - Every hyperparameter to tune adds new combination with other hyperparameters → Problem blows out quickly (scales exponentially) with many different hyperparameters
🔧 AMT = Automation tool, greatly reduces hyperparameter tuning overhead
Process
1. You define hyperparameters, ranges you want to try & metrics you are optimizing for
2. SageMaker spins up a “HyperParameter Tuning Job” that trains as many combinations as you’ll allow (spinning training instances as needed)
3. Set of hyperparameters producing best results can be deployed as a model
‼️ AMT learns as it goes, doesn't have to try every possible combination!
- Can also do early stopping automatically
- Saves a lot of time & money!!!
Best Practices
- Don’t optimize too many hyperparameters at once
  - Focus on the important ones
- Limit your ranges to as small a range as possible
  - Don't explore crazy values if you have guidelines already
- Use logarithmic scales when appropriate
  - Helps if e.g. range of a hyperparameter is 0.0001 to 0.1
- Don’t run too many training jobs concurrently
  - Limits how well the process can learn as it goes
- Make sure training jobs running on multiple instances report the correct objective metric in the end

Hyperparameter Tuning Configurations in AMT

Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45285747

Early Stopping → If set to Auto, stops training if tuning job is not improving objective metric significantly
- 👍 Reduces compute time, avoids overfitting
- ❗ ONLY for algorithms that emit objective metric after each epoch (most NNs do this)
Warm Start → Uses one or more previous tuning jobs as a starting point, i.e. “remembers” which hyperparameter combinations to search next
- Allows stop/restart of tuning jobs (stopping can be beneficial when bottlenecked by resources)
- Two types:
  1. IDENTICAL_DATA_AND_ALGORITHM → same dataset and algorithm
  2. TRANSFER_LEARNING → we can use a new dataset, but still continue where we left off
Resource Limits → Default limits for number of parallel tuning jobs, number of hyperparameters, number of training jobs per tuning job, etc.
- Limits are quite high by default
- Increasing limits requires support request to increase quota (AWS put these limits to prevent overspending by people who don't really know what they're doing)

Hyperparameter Tuning Approaches

Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45285747

Grid search → Try every possible combination (Brute force!)
- Limited to categorical parameters
- ❗ Scales very poorly (ever parameter you add skyrockets the amount of combinations) → can get out of hand very quickly
Random search → Chooses a random combination of hyperparameter values on each job
- 💡 Do you feel lucky, punk? :) Hope to get the optimal configuration in not so many tries
- 👍 Important advantage: no dependence on prior runs → good support for parallel jobs
Bayesian optimization → Treats tuning as a regression problem
- 👍 Learns from each run to converge on optimal values
- ❗ Runs must be sequential in order to learn from previous ones → parallel jobs not well supported
  - 💡 The learning sometimes helps to reach the objective sooner than parallel jobs
Hyperband → Powerful, many advantages
- 👍 Dynamically allocates resources, early stopping, parallel support…
- 👍 Much faster than random search or Bayesian!
- 👎 Disadvantage: algorithms must publish results iteratively (same requirement as auto early stopping)
  - 💡 But most NNs trained over several epochs fulfill this requirement