前回からの続き
前回: Azure AKS で Nvidia MIG を使う#1
今度は AKS の pod 上で Python から MIG を使う。
以下のファイルを用意。
1. kerastest_migcompat.py: GPU を使う python コード
import os # Set GPU environment before importing TensorFlow os.environ['CUDA_VISIBLE_DEVICES'] = '0' os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true' # TensorFlow and keras import tensorflow as tf import keras import numpy as np import matplotlib.pyplot as plt from tensorflow.python.client import device_lib print(tf.__version__) # GPU available? print(device_lib.list_local_devices()) # load the dataset fashion_mnist = keras.datasets.fashion_mnist (train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data() class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot'] train_images = train_images / 255.0 test_images = test_images / 255.0 # define and train the model model = keras.Sequential([ keras.layers.Flatten(input_shape=(28, 28)), keras.layers.Dense(8192, activation=tf.nn.relu), keras.layers.Dense(4096, activation=tf.nn.softmax) ]) model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) model.fit(train_images, train_labels, epochs=5)
2. Dockerfile: Nvidia ツールとか CUDA とかインストールしたイメージを作る
# Use NVIDIA CUDA 11.8 base image with Python 3.11 FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04 # Set environment variables ENV DEBIAN_FRONTEND=noninteractive ENV PYTHONUNBUFFERED=1 ENV TF_CPP_MIN_LOG_LEVEL=2 # Install Python 3.11 and system dependencies RUN apt-get update && apt-get install -y \ software-properties-common \ wget \ curl \ git \ && add-apt-repository ppa:deadsnakes/ppa \ && apt-get update && apt-get install -y \ python3.11 \ python3.11-dev \ python3.11-distutils \ python3-pip \ libgl1-mesa-glx \ libglib2.0-0 \ libsm6 \ libxext6 \ libxrender-dev \ libgomp1 \ && apt-get clean \ && rm -rf /var/lib/apt/lists/* # Set Python 3.11 as default RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1 \ && update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1 # Upgrade pip RUN python3.11 -m pip install --upgrade pip setuptools wheel # Create working directory WORKDIR /app # Copy requirements file first (for better Docker layer caching) COPY requirements.txt /app/ # Install Python packages from requirements.txt RUN pip install --no-cache-dir -r requirements.txt # Copy the Python script and entrypoint COPY kerastest_migcompat.py /app/ COPY entrypoint.sh /app/ # Make entrypoint script executable RUN chmod +x /app/entrypoint.sh # Create a non-root user RUN useradd -m -u 1000 mluser && \ chown -R mluser:mluser /app USER mluser # Use entrypoint to keep container running ENTRYPOINT ["/app/entrypoint.sh"]
3. entrypoint.sh
#!/bin/bash echo "Container started successfully!" echo "Python version:" python --version echo "" echo "TensorFlow GPU check:" python -c "import tensorflow as tf; print('TensorFlow version:', tf.__version__); print('GPU devices:', tf.config.list_physical_devices('GPU'))" echo "" echo "NVIDIA GPU information:" nvidia-smi -L echo "" echo "Container is ready for interactive use." echo "" echo "Keeping container alive..." # Keep the container running tail -f /dev/null
4. requirements.txt
tensorflow[and-cuda] keras numpy matplotlib scipy
5. keras-mig-deployment.yaml: Pod をデプロイするための YAML
apiVersion: v1 kind: Pod metadata: name: keras-mig-test labels: app: keras-mig-test spec: restartPolicy: Never containers: - name: keras-container image: orenoacr.azurecr.io/keras-mig-test:latest #Replace with your registry imagePullPolicy: Always resources: limits: "nvidia.com/mig-1g.10gb": 1 volumeMounts: - name: dshm mountPath: /dev/shm volumes: - name: dshm emptyDir: medium: Memory sizeLimit: 2Gi nodeSelector: node.kubernetes.io/instance-type: standard_nc24ads_a100_v4 tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule
で、ビルドしてデプロイする。
まずは ACR にログイン
az acr login --name orenoacr
続いて Docker build
# Build Docker image docker build -t keras-mig-test:latest . # Tag for your registry docker tag keras-mig-test:latest orenoacr.azurecr.io/keras-mig-test:latest # Push to registry docker push orenoacr.azurecr.io/keras-mig-test:latest # Deploy to Kubernetes kubectl apply -f keras-mig-deployment.yaml
Pod のログ確認
kubectl logs keras-mig-test
そして Pod 内の Python コードを実行してみる。
kubectl exec -it keras-mig-test -- python /app/kerastest_migcompat.py
GPU の MIG インスタンスを検知して GPU で動いた。
出力結果はこうなる。
Created device /device:GPU:0 with 8098 MB memory: -> device: 0, name: NVIDIA A100 80GB PCIe MIG 1g.10gb, pci bus id: 0001:00:00.0, compute capability: 8.0 [name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 8579157921897532563 xla_global_id: -1 , name: "/device:GPU:0" device_type: "GPU" memory_limit: 8491368448 locality { bus_id: 1 links { } } incarnation: 7748880231676912385 physical_device_desc: "device: 0, name: NVIDIA A100 80GB PCIe MIG 1g.10gb, pci bus id: 0001:00:00.0, compute capability: 8.0" xla_global_id: 416903419 ] /usr/local/lib/python3.11/dist-packages/keras/src/layers/reshaping/flatten.py:37: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(**kwargs) I0000 00:00:1749629166.172803 568 gpu_device.cc:2019] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 8098 MB memory: -> device: 0, name: NVIDIA A100 80GB PCIe MIG 1g.10gb, pci bus id: 0001:00:00.0, compute capability: 8.0 Epoch 1/5 WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1749629167.419059 672 service.cc:152] XLA service 0x7fee38009730 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: I0000 00:00:1749629167.419087 672 service.cc:160] StreamExecutor device (0): NVIDIA A100 80GB PCIe MIG 1g.10gb, Compute Capability 8.0 I0000 00:00:1749629167.495002 672 cuda_dnn.cc:529] Loaded cuDNN version 90300 I0000 00:00:1749629170.363931 672 device_compiler.h:188] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process. 1875/1875 ━━━━━━━━━━━━━━━━━━━━ 20s 9ms/step - accuracy: 0.7815 - loss: 0.6723 Epoch 2/5 1875/1875 ━━━━━━━━━━━━━━━━━━━━ 17s 9ms/step - accuracy: 0.8616 - loss: 0.3737 Epoch 3/5 1875/1875 ━━━━━━━━━━━━━━━━━━━━ 17s 9ms/step - accuracy: 0.8800 - loss: 0.3215 Epoch 4/5 1875/1875 ━━━━━━━━━━━━━━━━━━━━ 17s 9ms/step - accuracy: 0.8873 - loss: 0.3013 Epoch 5/5 1875/1875 ━━━━━━━━━━━━━━━━━━━━ 17s 9ms/step - accuracy: 0.8953 - loss: 0.2805
MIG で GPU を効率よくブン回す時代が来た(謎