Nvidia H100 GPU を搭載した RHEL8 で MIG を有効化したのでメモ。
まずは Nvidia ドライバーのインストール
sudo dnf update -y sudo dnf install kernel-devel kernel-headers gcc make -y sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo sudo dnf -y install nvidia-driver-latest-dkms
.bash_profile に以下を追記しておく。
export PATH=/usr/local/bin:$PATH export PATH=/usr/local/cuda/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH export TF_FORCE_GPU_ALLOW_GROWTH=true
インストールしたら再起動
sudo reboot
つづいて CUDA のインストール
sudo dnf install -y https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-keyring-1.0-1.el8.noarch.rpm sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo sudo dnf install -y cuda-drivers
そして MIG を有効化
# Enable persistence mode sudo nvidia-smi -pm 1 # Enable MIG mode sudo nvidia-persistenced --persistence-mode sudo nvidia-smi -mig 1
そしたら MIG の GPU instance profile を確認
$ sudo nvidia-smi mig -lgip +-----------------------------------------------------------------------------+ | GPU instance profiles: | | GPU Name ID Instances Memory P2P SM DEC ENC | | Free/Total GiB CE JPEG OFA | |=============================================================================| | 0 MIG 1g.12gb 19 7/7 10.75 No 16 1 0 | | 1 1 0 | +-----------------------------------------------------------------------------+ | 0 MIG 1g.12gb+me 20 1/1 10.75 No 16 1 0 | | 1 1 1 | +-----------------------------------------------------------------------------+ | 0 MIG 1g.24gb 15 4/4 21.62 No 26 1 0 | | 1 1 0 | +-----------------------------------------------------------------------------+ | 0 MIG 2g.24gb 14 3/3 21.62 No 32 2 0 | | 2 2 0 | +-----------------------------------------------------------------------------+ | 0 MIG 3g.47gb 9 2/2 46.38 No 60 3 0 | | 3 3 0 | +-----------------------------------------------------------------------------+ | 0 MIG 4g.47gb 5 1/1 46.38 No 64 4 0 | | 4 4 0 | +-----------------------------------------------------------------------------+ | 0 MIG 7g.94gb 0 1/1 93.12 No 132 7 0 | | 8 7 1 | +-----------------------------------------------------------------------------+
とりあえず 1g.12gb の Compute instance を 7個作る。
$ sudo nvidia-smi mig -cgi 1g.12gb,1g.12gb,1g.12gb,1g.12gb,1g.12gb,1g.12gb,1g.12gb -C Successfully created GPU instance ID 13 on GPU 0 using profile MIG 1g.12gb (ID 19) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 13 using profile MIG 1g.12gb (ID 0) Successfully created GPU instance ID 11 on GPU 0 using profile MIG 1g.12gb (ID 19) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 11 using profile MIG 1g.12gb (ID 0) Successfully created GPU instance ID 12 on GPU 0 using profile MIG 1g.12gb (ID 19) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 12 using profile MIG 1g.12gb (ID 0) Successfully created GPU instance ID 7 on GPU 0 using profile MIG 1g.12gb (ID 19) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 7 using profile MIG 1g.12gb (ID 0) Successfully created GPU instance ID 8 on GPU 0 using profile MIG 1g.12gb (ID 19) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 8 using profile MIG 1g.12gb (ID 0) Successfully created GPU instance ID 9 on GPU 0 using profile MIG 1g.12gb (ID 19) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 9 using profile MIG 1g.12gb (ID 0) Successfully created GPU instance ID 10 on GPU 0 using profile MIG 1g.12gb (ID 19) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 10 using profile MIG 1g.12gb (ID 0)
出来上がったか確認する。 7 個出来てた。
$ sudo nvidia-smi mig -lgi +-------------------------------------------------------+ | GPU instances: | | GPU Name Profile Instance Placement | | ID ID Start:Size | |=======================================================| | 0 MIG 1g.12gb 19 7 0:1 | +-------------------------------------------------------+ | 0 MIG 1g.12gb 19 8 1:1 | +-------------------------------------------------------+ | 0 MIG 1g.12gb 19 9 2:1 | +-------------------------------------------------------+ | 0 MIG 1g.12gb 19 10 3:1 | +-------------------------------------------------------+ | 0 MIG 1g.12gb 19 11 4:1 | +-------------------------------------------------------+ | 0 MIG 1g.12gb 19 12 5:1 | +-------------------------------------------------------+ | 0 MIG 1g.12gb 19 13 6:1 | +-------------------------------------------------------+
これで準備完了。
Python で動作確認する。
venv作ります。
python3.11 -m venv ~/python311_tf_env # Activate the virtual environment source ~/python311_tf_env/bin/activate # Upgrade pip first and install packages pip install --upgrade pip pip install tensorflow[and-cuda] pip install matplotlib pip install scikit-learn
まずは MIG を認識できてるかの確認のための python
import os
import tensorflow as tf
# Enable memory growth to avoid allocating all GPU memory at once
physical_devices = tf.config.list_physical_devices('GPU')
print("Physical devices:", physical_devices)
if physical_devices:
try:
for gpu in physical_devices:
tf.config.experimental.set_memory_growth(gpu, True)
print("Memory growth enabled")
except Exception as e:
print(f"Error setting memory growth: {e}")
# Print detailed GPU info
print("TensorFlow version:", tf.__version__)
print("CUDA visible devices:", os.environ.get('CUDA_VISIBLE_DEVICES', 'Not set'))
# Try a simple GPU operation
try:
with tf.device('/GPU:0'):
a = tf.constant([[1.0, 2.0], [3.0, 4.0]])
b = tf.constant([[5.0, 6.0], [7.0, 8.0]])
c = tf.matmul(a, b)
print("Matrix multiplication result:", c.numpy())
print("GPU operation successful!")
except Exception as e:
print(f"GPU operation failed: {e}")
実行結果
$ python checkmig.py Physical devices: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')] Memory growth enabled TensorFlow version: 2.19.0 CUDA visible devices: MIG-a9bb5f16-9a29-5172-9fe5-7af5908664bf MIG-5bb83066-7217-5c42-b99e-fe75efe6d7e5 MIG-1f6a0e31-cc52-5909-b14e-3a742942bd35 MIG-56f1b757-9dfd-58cc-b886-8ebe2c442094 MIG-0da82664-471e-5a6b-9738-4a6acf986175 MIG-2e7ddef9-c2f3-5e0b-965f-f4024769c795 MIG-d3ec2ae4-8d3d-5317-920b-3b8219c3176d Matrix multiplication result: [[19. 22.] [43. 50.]] GPU operation successful!
ちゃんと MIG instance を認識してる。
次は、一番目の MIG instance を使って python を実行してみる。
# TensorFlow and keras
import tensorflow as tf
import keras
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.python.client import device_lib
print(tf.__version__)
# GPU available?
print(device_lib.list_local_devices())
# load the dataset
fashion_mnist = keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
train_images = train_images / 255.0
test_images = test_images / 255.0
# define and train the model
model = keras.Sequential([
keras.layers.Flatten(input_shape=(28, 28)),
keras.layers.Dense(8192, activation=tf.nn.relu),
keras.layers.Dense(4096, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(train_images, train_labels, epochs=5)
実行結果。ちゃんと GPU 使ってるから速い。
$ CUDA_VISIBLE_DEVICES=MIG-a9bb5f16-9a29-5172-9fe5-7af5908664bf python gputest.py
2.19.0
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 7443360534658311321
xla_global_id: -1
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 9541976064
locality {
bus_id: 1
links {
}
}
incarnation: 7883565882498998567
physical_device_desc: "device: 0, name: NVIDIA H100 NVL MIG 1g.12gb, pci bus id: 0001:00:00.0, compute capability: 9.0"
xla_global_id: 416903419
]
Epoch 1/5
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 10s 5ms/step - accuracy: 0.7798 - loss: 0.6827
Epoch 2/5
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 9s 5ms/step - accuracy: 0.8595 - loss: 0.3803
Epoch 3/5
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 9s 5ms/step - accuracy: 0.8797 - loss: 0.3253
Epoch 4/5
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 9s 5ms/step - accuracy: 0.8930 - loss: 0.2903
Epoch 5/5
1875/1875 ━━━━━━━━━━━━━━━━━━━━ 9s 5ms/step - accuracy: 0.8960 - loss: 0.2788
<keras.src.callbacks.history.History at 0x7f9cdc376350>
以上