Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45286781
Structure in a training container
train
, validation
, test
…code
→ contains training codefailure
(containing errors) goes under output
(separate from the successful output, but under same directory)Structure in a deployment (inference) container
model
→ Contains inference code
model files
that were updated during training are now included here (they're NOT updated during inference)WORKDIR
nginx.conf
→ Configuration file for nginx frontend
predictor.py
→ Program that implements Flask web server (makes inference predictions)serve/
→ Program that is executed when container is started for hosting
predictor.py
train/
→ Program invoked when container is run for training
wsgi.py
→ Small wrapper used to invoke your Flask application for serving results.FROM tensorflow/tensorflow:2.0.0a0
RUN pip3 install sagemaker-training
# Copies the training code inside the container
COPY train.py /opt/ml/code/train.py
# Defines train.py as script entrypoint
ENV SAGEMAKER_PROGRAM train.py
sagemaker-training
= library for making containers compatible with SageMaker
sagemaker-containers
, which is now deprecated!SAGEMAKER_PROGRAM
→ defines script entrypoint (run script in /opt/ml/code
)
SAGEMAKER_TRAINING_MODULE
& SAGEMAKER_SERVICE_MODULE
SM_MODEL_DIR
→ Where model checkpoints are saved in S3SM_CHANNELS
/ SM_CHANNEL_*
→ train, test, validation… channelsSM_HPS
/ SM_HP_*
→ hyperparameters
SM_USER_ARGS