前回からの続き
前回: Azure AKS で Nvidia MIG を使う#1
今度は AKS の pod 上で Python から MIG を使う。
以下のファイルを用意。
1. kerastest_migcompat.py: GPU を使う python コード
import os
# Set GPU environment before importing TensorFlow
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'
# TensorFlow and keras
import tensorflow as tf
import keras
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.python.client import device_lib
print(tf.__version__)
# GPU available?
print(device_lib.list_local_devices())
# load the dataset
fashion_mnist = keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
train_images = train_images / 255.0
test_images = test_images / 255.0
# define and train the model
model = keras.Sequential([
keras.layers.Flatten(input_shape=(28, 28)),
keras.layers.Dense(8192, activation=tf.nn.relu),
keras.layers.Dense(4096, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(train_images, train_labels, epochs=5)
2. Dockerfile: Nvidia ツールとか CUDA とかインストールしたイメージを作る
# Use NVIDIA CUDA 11.8 base image with Python 3.11
FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04
# Set environment variables
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
ENV TF_CPP_MIN_LOG_LEVEL=2
# Install Python 3.11 and system dependencies
RUN apt-get update && apt-get install -y \
software-properties-common \
wget \
curl \
git \
&& add-apt-repository ppa:deadsnakes/ppa \
&& apt-get update && apt-get install -y \
python3.11 \
python3.11-dev \
python3.11-distutils \
python3-pip \
libgl1-mesa-glx \
libglib2.0-0 \
libsm6 \
libxext6 \
libxrender-dev \
libgomp1 \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
# Set Python 3.11 as default
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1 \
&& update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1
# Upgrade pip
RUN python3.11 -m pip install --upgrade pip setuptools wheel
# Create working directory
WORKDIR /app
# Copy requirements file first (for better Docker layer caching)
COPY requirements.txt /app/
# Install Python packages from requirements.txt
RUN pip install --no-cache-dir -r requirements.txt
# Copy the Python script and entrypoint
COPY kerastest_migcompat.py /app/
COPY entrypoint.sh /app/
# Make entrypoint script executable
RUN chmod +x /app/entrypoint.sh
# Create a non-root user
RUN useradd -m -u 1000 mluser && \
chown -R mluser:mluser /app
USER mluser
# Use entrypoint to keep container running
ENTRYPOINT ["/app/entrypoint.sh"]
3. entrypoint.sh
#!/bin/bash
echo "Container started successfully!"
echo "Python version:"
python --version
echo ""
echo "TensorFlow GPU check:"
python -c "import tensorflow as tf; print('TensorFlow version:', tf.__version__); print('GPU devices:', tf.config.list_physical_devices('GPU'))"
echo ""
echo "NVIDIA GPU information:"
nvidia-smi -L
echo ""
echo "Container is ready for interactive use."
echo ""
echo "Keeping container alive..."
# Keep the container running
tail -f /dev/null
4. requirements.txt
tensorflow[and-cuda] keras numpy matplotlib scipy
5. keras-mig-deployment.yaml: Pod をデプロイするための YAML
apiVersion: v1
kind: Pod
metadata:
name: keras-mig-test
labels:
app: keras-mig-test
spec:
restartPolicy: Never
containers:
- name: keras-container
image: orenoacr.azurecr.io/keras-mig-test:latest #Replace with your registry
imagePullPolicy: Always
resources:
limits:
"nvidia.com/mig-1g.10gb": 1
volumeMounts:
- name: dshm
mountPath: /dev/shm
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 2Gi
nodeSelector:
node.kubernetes.io/instance-type: standard_nc24ads_a100_v4
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
で、ビルドしてデプロイする。
まずは ACR にログイン
az acr login --name orenoacr
続いて Docker build
# Build Docker image docker build -t keras-mig-test:latest . # Tag for your registry docker tag keras-mig-test:latest orenoacr.azurecr.io/keras-mig-test:latest # Push to registry docker push orenoacr.azurecr.io/keras-mig-test:latest # Deploy to Kubernetes kubectl apply -f keras-mig-deployment.yaml
Pod のログ確認
kubectl logs keras-mig-test
そして Pod 内の Python コードを実行してみる。
kubectl exec -it keras-mig-test -- python /app/kerastest_migcompat.py
GPU の MIG インスタンスを検知して GPU で動いた。
出力結果はこうなる。
Created device /device:GPU:0 with 8098 MB memory: -> device: 0, name: NVIDIA A100 80GB PCIe MIG 1g.10gb, pci bus id: 0001:00:00.0, compute capability: 8.0
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 8579157921897532563
xla_global_id: -1
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 8491368448
locality {
bus_id: 1
links {
}
}
incarnation: 7748880231676912385
physical_device_desc: "device: 0, name: NVIDIA A100 80GB PCIe MIG 1g.10gb, pci bus id: 0001:00:00.0, compute capability: 8.0"
xla_global_id: 416903419
]
/usr/local/lib/python3.11/dist-packages/keras/src/layers/reshaping/flatten.py:37: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.
super().__init__(**kwargs)
I0000 00:00:1749629166.172803 568 gpu_device.cc:2019] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 8098 MB memory: -> device: 0, name: NVIDIA A100 80GB PCIe MIG 1g.10gb, pci bus id: 0001:00:00.0, compute capability: 8.0
Epoch 1/5
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1749629167.419059 672 service.cc:152] XLA service 0x7fee38009730 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1749629167.419087 672 service.cc:160] StreamExecutor device (0): NVIDIA A100 80GB PCIe MIG 1g.10gb, Compute Capability 8.0
I0000 00:00:1749629167.495002 672 cuda_dnn.cc:529] Loaded cuDNN version 90300
I0000 00:00:1749629170.363931 672 device_compiler.h:188] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 20s 9ms/step - accuracy: 0.7815 - loss: 0.6723
Epoch 2/5
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 17s 9ms/step - accuracy: 0.8616 - loss: 0.3737
Epoch 3/5
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 17s 9ms/step - accuracy: 0.8800 - loss: 0.3215
Epoch 4/5
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 17s 9ms/step - accuracy: 0.8873 - loss: 0.3013
Epoch 5/5
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 17s 9ms/step - accuracy: 0.8953 - loss: 0.2805
MIG で GPU を効率よくブン回す時代が来た(謎