Nvidia H100 GPU で MIG を使う

 
Nvidia H100 GPU を搭載した RHEL8 で MIG を有効化したのでメモ。

まずは Nvidia ドライバーのインストール

sudo dnf update -y
sudo dnf install kernel-devel kernel-headers gcc make -y
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
sudo dnf -y install nvidia-driver-latest-dkms

.bash_profile に以下を追記しておく。

export PATH=/usr/local/bin:$PATH
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export TF_FORCE_GPU_ALLOW_GROWTH=true

インストールしたら再起動

sudo reboot

つづいて CUDA のインストール

sudo dnf install -y https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-keyring-1.0-1.el8.noarch.rpm
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
sudo dnf install -y cuda-drivers

そして MIG を有効化

# Enable persistence mode
sudo nvidia-smi -pm 1

# Enable MIG mode
sudo nvidia-persistenced --persistence-mode
sudo nvidia-smi -mig 1

そしたら MIG の GPU instance profile を確認

$ sudo nvidia-smi mig -lgip
+-----------------------------------------------------------------------------+
| GPU instance profiles:                                                      |
| GPU   Name             ID    Instances   Memory     P2P    SM    DEC   ENC  |
|                              Free/Total   GiB              CE    JPEG  OFA  |
|=============================================================================|
|   0  MIG 1g.12gb       19     7/7        10.75      No     16     1     0   |
|                                                             1     1     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 1g.12gb+me    20     1/1        10.75      No     16     1     0   |
|                                                             1     1     1   |
+-----------------------------------------------------------------------------+
|   0  MIG 1g.24gb       15     4/4        21.62      No     26     1     0   |
|                                                             1     1     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 2g.24gb       14     3/3        21.62      No     32     2     0   |
|                                                             2     2     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 3g.47gb        9     2/2        46.38      No     60     3     0   |
|                                                             3     3     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 4g.47gb        5     1/1        46.38      No     64     4     0   |
|                                                             4     4     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 7g.94gb        0     1/1        93.12      No     132    7     0   |
|                                                             8     7     1   |
+-----------------------------------------------------------------------------+

とりあえず 1g.12gb の Compute instance を 7個作る。

$ sudo nvidia-smi mig -cgi 1g.12gb,1g.12gb,1g.12gb,1g.12gb,1g.12gb,1g.12gb,1g.12gb -C
Successfully created GPU instance ID 13 on GPU  0 using profile MIG 1g.12gb (ID 19)
Successfully created compute instance ID  0 on GPU  0 GPU instance ID 13 using profile MIG 1g.12gb (ID  0)
Successfully created GPU instance ID 11 on GPU  0 using profile MIG 1g.12gb (ID 19)
Successfully created compute instance ID  0 on GPU  0 GPU instance ID 11 using profile MIG 1g.12gb (ID  0)
Successfully created GPU instance ID 12 on GPU  0 using profile MIG 1g.12gb (ID 19)
Successfully created compute instance ID  0 on GPU  0 GPU instance ID 12 using profile MIG 1g.12gb (ID  0)
Successfully created GPU instance ID  7 on GPU  0 using profile MIG 1g.12gb (ID 19)
Successfully created compute instance ID  0 on GPU  0 GPU instance ID  7 using profile MIG 1g.12gb (ID  0)
Successfully created GPU instance ID  8 on GPU  0 using profile MIG 1g.12gb (ID 19)
Successfully created compute instance ID  0 on GPU  0 GPU instance ID  8 using profile MIG 1g.12gb (ID  0)
Successfully created GPU instance ID  9 on GPU  0 using profile MIG 1g.12gb (ID 19)
Successfully created compute instance ID  0 on GPU  0 GPU instance ID  9 using profile MIG 1g.12gb (ID  0)
Successfully created GPU instance ID 10 on GPU  0 using profile MIG 1g.12gb (ID 19)
Successfully created compute instance ID  0 on GPU  0 GPU instance ID 10 using profile MIG 1g.12gb (ID  0)

出来上がったか確認する。 7 個出来てた。

$ sudo nvidia-smi mig -lgi
+-------------------------------------------------------+
| GPU instances:                                        |
| GPU   Name             Profile  Instance   Placement  |
|                          ID       ID       Start:Size |
|=======================================================|
|   0  MIG 1g.12gb         19        7          0:1     |
+-------------------------------------------------------+
|   0  MIG 1g.12gb         19        8          1:1     |
+-------------------------------------------------------+
|   0  MIG 1g.12gb         19        9          2:1     |
+-------------------------------------------------------+
|   0  MIG 1g.12gb         19       10          3:1     |
+-------------------------------------------------------+
|   0  MIG 1g.12gb         19       11          4:1     |
+-------------------------------------------------------+
|   0  MIG 1g.12gb         19       12          5:1     |
+-------------------------------------------------------+
|   0  MIG 1g.12gb         19       13          6:1     |
+-------------------------------------------------------+

これで準備完了。
Python で動作確認する。

venv作ります。

python3.11 -m venv ~/python311_tf_env

# Activate the virtual environment
source ~/python311_tf_env/bin/activate

# Upgrade pip first and install packages
pip install --upgrade pip
pip install tensorflow[and-cuda]
pip install matplotlib
pip install scikit-learn

まずは MIG を認識できてるかの確認のための python

import os
import tensorflow as tf

# Enable memory growth to avoid allocating all GPU memory at once
physical_devices = tf.config.list_physical_devices('GPU')
print("Physical devices:", physical_devices)

if physical_devices:
    try:
        for gpu in physical_devices:
            tf.config.experimental.set_memory_growth(gpu, True)
        print("Memory growth enabled")
    except Exception as e:
        print(f"Error setting memory growth: {e}")

# Print detailed GPU info
print("TensorFlow version:", tf.__version__)
print("CUDA visible devices:", os.environ.get('CUDA_VISIBLE_DEVICES', 'Not set'))

# Try a simple GPU operation
try:
    with tf.device('/GPU:0'):
        a = tf.constant([[1.0, 2.0], [3.0, 4.0]])
        b = tf.constant([[5.0, 6.0], [7.0, 8.0]])
        c = tf.matmul(a, b)
        print("Matrix multiplication result:", c.numpy())
        print("GPU operation successful!")
except Exception as e:
    print(f"GPU operation failed: {e}")

実行結果

$ python checkmig.py
Physical devices: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Memory growth enabled
TensorFlow version: 2.19.0
CUDA visible devices: MIG-a9bb5f16-9a29-5172-9fe5-7af5908664bf
MIG-5bb83066-7217-5c42-b99e-fe75efe6d7e5
MIG-1f6a0e31-cc52-5909-b14e-3a742942bd35
MIG-56f1b757-9dfd-58cc-b886-8ebe2c442094
MIG-0da82664-471e-5a6b-9738-4a6acf986175
MIG-2e7ddef9-c2f3-5e0b-965f-f4024769c795
MIG-d3ec2ae4-8d3d-5317-920b-3b8219c3176d
Matrix multiplication result: [[19. 22.]
 [43. 50.]]
GPU operation successful!

ちゃんと MIG instance を認識してる。
次は、一番目の MIG instance を使って python を実行してみる。

# TensorFlow and keras
import tensorflow as tf
import keras
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.python.client import device_lib
print(tf.__version__)

# GPU available?
print(device_lib.list_local_devices())

# load the dataset 
fashion_mnist = keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

train_images = train_images / 255.0
test_images = test_images / 255.0

# define and train the model
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(8192, activation=tf.nn.relu),
    keras.layers.Dense(4096, activation=tf.nn.softmax)
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(train_images, train_labels, epochs=5)

実行結果。ちゃんと GPU 使ってるから速い。

CUDA_VISIBLE_DEVICES=MIG-a9bb5f16-9a29-5172-9fe5-7af5908664bf python gputest.py
2.19.0
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 7443360534658311321
xla_global_id: -1
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 9541976064
locality {
  bus_id: 1
  links {
  }
}
incarnation: 7883565882498998567
physical_device_desc: "device: 0, name: NVIDIA H100 NVL MIG 1g.12gb, pci bus id: 0001:00:00.0, compute capability: 9.0"
xla_global_id: 416903419
]
Epoch 1/5
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 10s 5ms/step - accuracy: 0.7798 - loss: 0.6827
Epoch 2/5
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 9s 5ms/step - accuracy: 0.8595 - loss: 0.3803
Epoch 3/5
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 9s 5ms/step - accuracy: 0.8797 - loss: 0.3253
Epoch 4/5
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 9s 5ms/step - accuracy: 0.8930 - loss: 0.2903
Epoch 5/5
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 9s 5ms/step - accuracy: 0.8960 - loss: 0.2788
<keras.src.callbacks.history.History at 0x7f9cdc376350>

以上