Automatic Model Tuning (AMT)
Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45285731
- ❗ Hyperparameter tuning is far from trivial!
- 💡 ML Engineers often have to experiment (trial & error)
- What are best hyperparameters? Best values for hyperparameters? Depth of NN? …
- How to avoid overfitting, getting stuck in local minima…?
- Every hyperparameter to tune adds new combination with other hyperparameters → Problem blows out quickly (scales exponentially) with many different hyperparameters
- 🔧 AMT = Automation tool, greatly reduces hyperparameter tuning overhead
- Process
- You define hyperparameters, ranges you want to try & metrics you are optimizing for
- SageMaker spins up a “HyperParameter Tuning Job” that trains as many combinations as you’ll allow (spinning training instances as needed)
- Set of hyperparameters producing best results can be deployed as a model
- ‼️ AMT learns as it goes, doesn't have to try every possible combination!
- Can also do early stopping automatically
- Saves a lot of time & money!!!
- Best Practices
- Don’t optimize too many hyperparameters at once
- Focus on the important ones
- Limit your ranges to as small a range as possible
- Don't explore crazy values if you have guidelines already
- Use logarithmic scales when appropriate
- Helps if e.g. range of a hyperparameter is 0.0001 to 0.1
- Don’t run too many training jobs concurrently
- Limits how well the process can learn as it goes
- Make sure training jobs running on multiple instances report the correct objective metric in the end
Hyperparameter Tuning Configurations in AMT
Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45285747
- Early Stopping → If set to
Auto
, stops training if tuning job is not improving objective metric significantly
- 👍 Reduces compute time, avoids overfitting
- ❗ ONLY for algorithms that emit objective metric after each epoch (most NNs do this)
- Warm Start → Uses one or more previous tuning jobs as a starting point, i.e. “remembers” which hyperparameter combinations to search next
- Allows stop/restart of tuning jobs (stopping can be beneficial when bottlenecked by resources)
- Two types:
IDENTICAL_DATA_AND_ALGORITHM
→ same dataset and algorithm
TRANSFER_LEARNING
→ we can use a new dataset, but still continue where we left off
- Resource Limits → Default limits for number of parallel tuning jobs, number of hyperparameters, number of training jobs per tuning job, etc.
- Limits are quite high by default
- Increasing limits requires support request to increase quota (AWS put these limits to prevent overspending by people who don't really know what they're doing)
Hyperparameter Tuning Approaches
Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45285747
- Grid search → Try every possible combination (Brute force!)
- Limited to categorical parameters
- ❗ Scales very poorly, can get out of hand very quickly
- Random search → Chooses a random combination of hyperparameter values on each job
- 💡 Do you feel lucky, punk? :) Hope to get the optimal configuration in not so many tries
- 👍 Important advantage: no dependence on prior runs → good support for parallel jobs
- Bayesian optimization → Treats tuning as a regression problem
- 👍 Learns from each run to converge on optimal values
- ❗ Runs must be sequential in order to learn from previous ones → parallel jobs not well supported
- 💡 The learning sometimes helps to reach the objective sooner than parallel jobs