💡 For an intro to Docker and containers, check: Docker Fundamentals
Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45286781
Structure in a training container
/opt/ml/
input/
config/
hyperparameters.json
data/
<channel_name>
<input data>
model/
code/
<script files>
output/
failure/
train, validation, test…code → contains training codefailure (containing errors) goes under output (separate from the successful output, but under same directory)Structure in a deployment (inference) container
/opt/ml/
model/
<model files>
model → Contains inference code
model files that were updated during training are now included here (they're NOT updated during inference)WORKDIR
nginx.conf → Configuration file for nginx frontend
predictor.py → Program that implements Flask web server (makes inference predictions)serve/ → Program that is executed when container is started for hosting
predictor.pytrain/ → Program invoked when container is run for training
wsgi.py → Small wrapper used to invoke your Flask application for serving results.FROM tensorflow/tensorflow:2.0.0a0
RUN pip3 install sagemaker-training
# Copies the training code inside the container
COPY train.py /opt/ml/code/train.py
# Defines train.py as script entrypoint
ENV SAGEMAKER_PROGRAM train.py
sagemaker-training = library for making containers compatible with SageMaker
sagemaker-containers, which is now deprecated!SAGEMAKER_PROGRAM → defines script entrypoint (run script in /opt/ml/code )
SAGEMAKER_TRAINING_MODULE & SAGEMAKER_SERVICE_MODULE
SM_MODEL_DIR → Where model checkpoints are saved in S3SM_CHANNELS / SM_CHANNEL_* → train, test, validation… channelsSM_HPS / SM_HP_* → hyperparameters
SM_USER_ARGS