Azure AKS で Nvidia MIG を使う#2

前回からの続き

前回: Azure AKS で Nvidia MIG を使う#1

今度は AKS の pod 上で Python から MIG を使う。

以下のファイルを用意。

1. kerastest_migcompat.py: GPU を使う python コード

import os
# Set GPU environment before importing TensorFlow
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'

# TensorFlow and keras
import tensorflow as tf
import keras
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.python.client import device_lib
print(tf.__version__)

# GPU available?
print(device_lib.list_local_devices())

# load the dataset 
fashion_mnist = keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

train_images = train_images / 255.0
test_images = test_images / 255.0

# define and train the model
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(8192, activation=tf.nn.relu),
    keras.layers.Dense(4096, activation=tf.nn.softmax)
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(train_images, train_labels, epochs=5)

 
2. Dockerfile: Nvidia ツールとか CUDA とかインストールしたイメージを作る

# Use NVIDIA CUDA 11.8 base image with Python 3.11
FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04

# Set environment variables
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
ENV TF_CPP_MIN_LOG_LEVEL=2

# Install Python 3.11 and system dependencies
RUN apt-get update && apt-get install -y \
    software-properties-common \
    wget \
    curl \
    git \
    && add-apt-repository ppa:deadsnakes/ppa \
    && apt-get update && apt-get install -y \
    python3.11 \
    python3.11-dev \
    python3.11-distutils \
    python3-pip \
    libgl1-mesa-glx \
    libglib2.0-0 \
    libsm6 \
    libxext6 \
    libxrender-dev \
    libgomp1 \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# Set Python 3.11 as default
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1 \
    && update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1

# Upgrade pip
RUN python3.11 -m pip install --upgrade pip setuptools wheel

# Create working directory
WORKDIR /app

# Copy requirements file first (for better Docker layer caching)
COPY requirements.txt /app/

# Install Python packages from requirements.txt
RUN pip install --no-cache-dir -r requirements.txt

# Copy the Python script and entrypoint
COPY kerastest_migcompat.py /app/
COPY entrypoint.sh /app/

# Make entrypoint script executable
RUN chmod +x /app/entrypoint.sh

# Create a non-root user
RUN useradd -m -u 1000 mluser && \
    chown -R mluser:mluser /app

USER mluser

# Use entrypoint to keep container running
ENTRYPOINT ["/app/entrypoint.sh"]

 
3. entrypoint.sh

#!/bin/bash

echo "Container started successfully!"
echo "Python version:"
python --version

echo ""
echo "TensorFlow GPU check:"
python -c "import tensorflow as tf; print('TensorFlow version:', tf.__version__); print('GPU devices:', tf.config.list_physical_devices('GPU'))"

echo ""
echo "NVIDIA GPU information:"
nvidia-smi -L

echo ""
echo "Container is ready for interactive use."
echo ""
echo "Keeping container alive..."

# Keep the container running
tail -f /dev/null

 

4. requirements.txt

tensorflow[and-cuda]
keras
numpy
matplotlib
scipy

 

5. keras-mig-deployment.yaml: Pod をデプロイするための YAML

apiVersion: v1
kind: Pod
metadata:
  name: keras-mig-test
  labels:
    app: keras-mig-test
spec:
  restartPolicy: Never
  containers:
  - name: keras-container
    image: orenoacr.azurecr.io/keras-mig-test:latest #Replace with your registry
    imagePullPolicy: Always
    resources:
      limits:
        "nvidia.com/mig-1g.10gb": 1
    volumeMounts:
    - name: dshm
      mountPath: /dev/shm
  volumes:
  - name: dshm
    emptyDir:
      medium: Memory
      sizeLimit: 2Gi
  nodeSelector:
    node.kubernetes.io/instance-type: standard_nc24ads_a100_v4
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

 

で、ビルドしてデプロイする。
まずは ACR にログイン

az acr login --name orenoacr

続いて Docker build

# Build Docker image
docker build -t keras-mig-test:latest .

# Tag for your registry
docker tag keras-mig-test:latest orenoacr.azurecr.io/keras-mig-test:latest

# Push to registry
docker push orenoacr.azurecr.io/keras-mig-test:latest

# Deploy to Kubernetes
kubectl apply -f keras-mig-deployment.yaml

Pod のログ確認

kubectl logs keras-mig-test

そして Pod 内の Python コードを実行してみる。

kubectl exec -it keras-mig-test -- python /app/kerastest_migcompat.py

GPU の MIG インスタンスを検知して GPU で動いた。

出力結果はこうなる。

Created device /device:GPU:0 with 8098 MB memory:  -> device: 0, name: NVIDIA A100 80GB PCIe MIG 1g.10gb, pci bus id: 0001:00:00.0, compute capability: 8.0
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 8579157921897532563
xla_global_id: -1
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 8491368448
locality {
  bus_id: 1
  links {
  }
}
incarnation: 7748880231676912385
physical_device_desc: "device: 0, name: NVIDIA A100 80GB PCIe MIG 1g.10gb, pci bus id: 0001:00:00.0, compute capability: 8.0"
xla_global_id: 416903419
]
/usr/local/lib/python3.11/dist-packages/keras/src/layers/reshaping/flatten.py:37: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.
  super().__init__(**kwargs)
I0000 00:00:1749629166.172803     568 gpu_device.cc:2019] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 8098 MB memory:  -> device: 0, name: NVIDIA A100 80GB PCIe MIG 1g.10gb, pci bus id: 0001:00:00.0, compute capability: 8.0
Epoch 1/5
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1749629167.419059     672 service.cc:152] XLA service 0x7fee38009730 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1749629167.419087     672 service.cc:160]   StreamExecutor device (0): NVIDIA A100 80GB PCIe MIG 1g.10gb, Compute Capability 8.0
I0000 00:00:1749629167.495002     672 cuda_dnn.cc:529] Loaded cuDNN version 90300
I0000 00:00:1749629170.363931     672 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 20s 9ms/step - accuracy: 0.7815 - loss: 0.6723
Epoch 2/5
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 17s 9ms/step - accuracy: 0.8616 - loss: 0.3737
Epoch 3/5
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 17s 9ms/step - accuracy: 0.8800 - loss: 0.3215
Epoch 4/5
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 17s 9ms/step - accuracy: 0.8873 - loss: 0.3013
Epoch 5/5
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 17s 9ms/step - accuracy: 0.8953 - loss: 0.2805

MIG で GPU を効率よくブン回す時代が来た(謎