Automatic Model Tuning (AMT)
Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45285731
- β Hyperparameter tuning is far from trivial!
- π‘ ML Engineers often have to experiment (trial & error)
- What are best hyperparameters? Best values for hyperparameters? Depth of NN? β¦
- How to avoid overfitting, getting stuck in local minima�
- Every hyperparameter to tune adds new combination with other hyperparameters β Problem blows out quickly (scales exponentially) with many different hyperparameters
- π§ AMT = Automation tool, greatly reduces hyperparameter tuning overhead
- Process
- You define hyperparameters, ranges you want to try & metrics you are optimizing for
- SageMaker spins up a βHyperParameter Tuning Jobβ that trains as many combinations as youβll allow (spinning training instances as needed)
- Set of hyperparameters producing best results can be deployed as a model
- βΌοΈ AMT learns as it goes, doesn't have to try every possible combination!
- Can also do early stopping automatically
- Saves a lot of time & money!!!
- Best Practices
- Donβt optimize too many hyperparameters at once
- Focus on the important ones
- Limit your ranges to as small a range as possible
- Don't explore crazy values if you have guidelines already
- Use logarithmic scales when appropriate
- Helps if e.g. range of a hyperparameter is 0.0001 to 0.1
- Donβt run too many training jobs concurrently
- Limits how well the process can learn as it goes
- Make sure training jobs running on multiple instances report the correct objective metric in the end
Hyperparameter Tuning Configurations in AMT
Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45285747
- Early Stopping β If set to
Auto
, stops training if tuning job is not improving objective metric significantly
- π Reduces compute time, avoids overfitting
- β ONLY for algorithms that emit objective metric after each epoch (most NNs do this)
- Warm Start β Uses one or more previous tuning jobs as a starting point, i.e. βremembersβ which hyperparameter combinations to search next
- Allows stop/restart of tuning jobs (stopping can be beneficial when bottlenecked by resources)
- Two types:
IDENTICAL_DATA_AND_ALGORITHM
β same dataset and algorithm
TRANSFER_LEARNING
β we can use a new dataset, but still continue where we left off
- Resource Limits β Default limits for number of parallel tuning jobs, number of hyperparameters, number of training jobs per tuning job, etc.
- Limits are quite high by default
- Increasing limits requires support request to increase quota (AWS put these limits to prevent overspending by people who don't really know what they're doing)
Hyperparameter Tuning Approaches
Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45285747
- Grid search β Try every possible combination (Brute force!)
- Limited to categorical parameters
- β Scales very poorly (ever parameter you add skyrockets the amount of combinations) β can get out of hand very quickly
- Random search β Chooses a random combination of hyperparameter values on each job
- π‘ Do you feel lucky, punk? :) Hope to get the optimal configuration in not so many tries
- π Important advantage: no dependence on prior runs β good support for parallel jobs
- Bayesian optimization β Treats tuning as a regression problem
- π Learns from each run to converge on optimal values
- β Runs must be sequential in order to learn from previous ones β parallel jobs not well supported
- π‘ The learning sometimes helps to reach the objective sooner than parallel jobs