SageMaker: Machine Learning Operations (MLOps)

Contents:

SageMaker + Docker

💡 For an intro to Docker and containers, check: Docker Fundamentals

Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45286781

🔧 ALL SageMaker models are hosted in Docker containers (registered in ECR)
- Pre-built Deep Learning models
- Pre-built scikit-learn and Spark ML
- Deep Learning Containers (DLCs) = pre-built TensorFlow, PyTorch, MXNet, Chainer
  - ❗ Distributed training via Horovod framework or Parameter Servers
- Extensions/fine-tunings of pre-built image
- Your own training and/or inference code (hosted in a Docker image)
Containers are isolated and contain all dependencies & resources needed to run → You can use ANY script or algorithm in SageMaker, regardless of runtime or language!
- Refresher: Dockerfile → Docker image → Docker container (runs from image)
  - ECR: repository for registering Docker images
Architecture Diagram

Directory structure inside SageMaker Docker containers

Structure in a training container
```
/opt/ml/
    input/
        config/
	        hyperparameters.json
        data/
	        <channel_name>
		        <input data>
    model/
    code/
	    <script files>
    output/
        failure/
```
- Channel names can be e.g. train, validation, test…
- code → contains training code
- ‼️ Inaccuracy in the diagram above! → failure (containing errors) goes under output (separate from the successful output, but under same directory)
Structure in a deployment (inference) container
```
/opt/ml/
    model/
	    <model files>
```
- model → Contains inference code
  - model files that were updated during training are now included here (they're NOT updated during inference)

Docker Image Structure

WORKDIR
- nginx.conf → Configuration file for nginx frontend
  - ❗ We run a web server at deployment/inference time
- predictor.py → Program that implements Flask web server (makes inference predictions)
- serve/ → Program that is executed when container is started for hosting
  - A file inside launches the Gunicorn server, which runs multiple instances of a Flask application defined in predictor.py
- train/ → Program invoked when container is run for training
  - Structure already covered in training container structure
- wsgi.py → Small wrapper used to invoke your Flask application for serving results.
💡 You can have separate images for training and deployment, or you can combine them into one image. Structure above is for the combination of both.

Assembling it all in a Dockerfile

FROM tensorflow/tensorflow:2.0.0a0
RUN pip3 install sagemaker-training

# Copies the training code inside the container
COPY train.py /opt/ml/code/train.py

# Defines train.py as script entrypoint
ENV SAGEMAKER_PROGRAM train.py

sagemaker-training = library for making containers compatible with SageMaker
- ‼️ Replaces sagemaker-containers, which is now deprecated!
Environment variables:
- SAGEMAKER_PROGRAM → defines script entrypoint (run script in /opt/ml/code )
  - ❗ Required! (all other env vars are optional)
- SAGEMAKER_TRAINING_MODULE & SAGEMAKER_SERVICE_MODULE
  - Where you load modules of TensorFlow, MXNet…
- SM_MODEL_DIR → Where model checkpoints are saved in S3
- SM_CHANNELS / SM_CHANNEL_* → train, test, validation… channels
- SM_HPS / SM_HP_* → hyperparameters
  - Can be used by SageMaker AMT
- SM_USER_ARGS
- …and many more
Example code snippet for using the Docker image (e.g. in a Notebook)