Nvidia H100 GPU を搭載した RHEL8 で MIG を有効化したのでメモ。
まずは Nvidia ドライバーのインストール
sudo dnf update -y sudo dnf install kernel-devel kernel-headers gcc make -y sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo sudo dnf -y install nvidia-driver-latest-dkms
.bash_profile に以下を追記しておく。
export PATH=/usr/local/bin:$PATH export PATH=/usr/local/cuda/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH export TF_FORCE_GPU_ALLOW_GROWTH=true
インストールしたら再起動
sudo reboot
つづいて CUDA のインストール
sudo dnf install -y https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-keyring-1.0-1.el8.noarch.rpm sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo sudo dnf install -y cuda-drivers
そして MIG を有効化
# Enable persistence mode sudo nvidia-smi -pm 1 # Enable MIG mode sudo nvidia-persistenced --persistence-mode sudo nvidia-smi -mig 1
そしたら MIG の GPU instance profile を確認
$ sudo nvidia-smi mig -lgip +-----------------------------------------------------------------------------+ | GPU instance profiles: | | GPU Name ID Instances Memory P2P SM DEC ENC | | Free/Total GiB CE JPEG OFA | |=============================================================================| | 0 MIG 1g.12gb 19 7/7 10.75 No 16 1 0 | | 1 1 0 | +-----------------------------------------------------------------------------+ | 0 MIG 1g.12gb+me 20 1/1 10.75 No 16 1 0 | | 1 1 1 | +-----------------------------------------------------------------------------+ | 0 MIG 1g.24gb 15 4/4 21.62 No 26 1 0 | | 1 1 0 | +-----------------------------------------------------------------------------+ | 0 MIG 2g.24gb 14 3/3 21.62 No 32 2 0 | | 2 2 0 | +-----------------------------------------------------------------------------+ | 0 MIG 3g.47gb 9 2/2 46.38 No 60 3 0 | | 3 3 0 | +-----------------------------------------------------------------------------+ | 0 MIG 4g.47gb 5 1/1 46.38 No 64 4 0 | | 4 4 0 | +-----------------------------------------------------------------------------+ | 0 MIG 7g.94gb 0 1/1 93.12 No 132 7 0 | | 8 7 1 | +-----------------------------------------------------------------------------+
とりあえず 1g.12gb の Compute instance を 7個作る。
$ sudo nvidia-smi mig -cgi 1g.12gb,1g.12gb,1g.12gb,1g.12gb,1g.12gb,1g.12gb,1g.12gb -C Successfully created GPU instance ID 13 on GPU 0 using profile MIG 1g.12gb (ID 19) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 13 using profile MIG 1g.12gb (ID 0) Successfully created GPU instance ID 11 on GPU 0 using profile MIG 1g.12gb (ID 19) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 11 using profile MIG 1g.12gb (ID 0) Successfully created GPU instance ID 12 on GPU 0 using profile MIG 1g.12gb (ID 19) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 12 using profile MIG 1g.12gb (ID 0) Successfully created GPU instance ID 7 on GPU 0 using profile MIG 1g.12gb (ID 19) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 7 using profile MIG 1g.12gb (ID 0) Successfully created GPU instance ID 8 on GPU 0 using profile MIG 1g.12gb (ID 19) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 8 using profile MIG 1g.12gb (ID 0) Successfully created GPU instance ID 9 on GPU 0 using profile MIG 1g.12gb (ID 19) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 9 using profile MIG 1g.12gb (ID 0) Successfully created GPU instance ID 10 on GPU 0 using profile MIG 1g.12gb (ID 19) Successfully created compute instance ID 0 on GPU 0 GPU instance ID 10 using profile MIG 1g.12gb (ID 0)
出来上がったか確認する。 7 個出来てた。
$ sudo nvidia-smi mig -lgi +-------------------------------------------------------+ | GPU instances: | | GPU Name Profile Instance Placement | | ID ID Start:Size | |=======================================================| | 0 MIG 1g.12gb 19 7 0:1 | +-------------------------------------------------------+ | 0 MIG 1g.12gb 19 8 1:1 | +-------------------------------------------------------+ | 0 MIG 1g.12gb 19 9 2:1 | +-------------------------------------------------------+ | 0 MIG 1g.12gb 19 10 3:1 | +-------------------------------------------------------+ | 0 MIG 1g.12gb 19 11 4:1 | +-------------------------------------------------------+ | 0 MIG 1g.12gb 19 12 5:1 | +-------------------------------------------------------+ | 0 MIG 1g.12gb 19 13 6:1 | +-------------------------------------------------------+
これで準備完了。
Python で動作確認する。
venv作ります。
python3.11 -m venv ~/python311_tf_env # Activate the virtual environment source ~/python311_tf_env/bin/activate # Upgrade pip first and install packages pip install --upgrade pip pip install tensorflow[and-cuda] pip install matplotlib pip install scikit-learn
まずは MIG を認識できてるかの確認のための python
import os import tensorflow as tf # Enable memory growth to avoid allocating all GPU memory at once physical_devices = tf.config.list_physical_devices('GPU') print("Physical devices:", physical_devices) if physical_devices: try: for gpu in physical_devices: tf.config.experimental.set_memory_growth(gpu, True) print("Memory growth enabled") except Exception as e: print(f"Error setting memory growth: {e}") # Print detailed GPU info print("TensorFlow version:", tf.__version__) print("CUDA visible devices:", os.environ.get('CUDA_VISIBLE_DEVICES', 'Not set')) # Try a simple GPU operation try: with tf.device('/GPU:0'): a = tf.constant([[1.0, 2.0], [3.0, 4.0]]) b = tf.constant([[5.0, 6.0], [7.0, 8.0]]) c = tf.matmul(a, b) print("Matrix multiplication result:", c.numpy()) print("GPU operation successful!") except Exception as e: print(f"GPU operation failed: {e}")
実行結果
$ python checkmig.py Physical devices: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')] Memory growth enabled TensorFlow version: 2.19.0 CUDA visible devices: MIG-a9bb5f16-9a29-5172-9fe5-7af5908664bf MIG-5bb83066-7217-5c42-b99e-fe75efe6d7e5 MIG-1f6a0e31-cc52-5909-b14e-3a742942bd35 MIG-56f1b757-9dfd-58cc-b886-8ebe2c442094 MIG-0da82664-471e-5a6b-9738-4a6acf986175 MIG-2e7ddef9-c2f3-5e0b-965f-f4024769c795 MIG-d3ec2ae4-8d3d-5317-920b-3b8219c3176d Matrix multiplication result: [[19. 22.] [43. 50.]] GPU operation successful!
ちゃんと MIG instance を認識してる。
次は、一番目の MIG instance を使って python を実行してみる。
# TensorFlow and keras import tensorflow as tf import keras import numpy as np import matplotlib.pyplot as plt from tensorflow.python.client import device_lib print(tf.__version__) # GPU available? print(device_lib.list_local_devices()) # load the dataset fashion_mnist = keras.datasets.fashion_mnist (train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data() class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot'] train_images = train_images / 255.0 test_images = test_images / 255.0 # define and train the model model = keras.Sequential([ keras.layers.Flatten(input_shape=(28, 28)), keras.layers.Dense(8192, activation=tf.nn.relu), keras.layers.Dense(4096, activation=tf.nn.softmax) ]) model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) model.fit(train_images, train_labels, epochs=5)
実行結果。ちゃんと GPU 使ってるから速い。
CUDA_VISIBLE_DEVICES=MIG-a9bb5f16-9a29-5172-9fe5-7af5908664bf python gputest.py 2.19.0 [name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 7443360534658311321 xla_global_id: -1 , name: "/device:GPU:0" device_type: "GPU" memory_limit: 9541976064 locality { bus_id: 1 links { } } incarnation: 7883565882498998567 physical_device_desc: "device: 0, name: NVIDIA H100 NVL MIG 1g.12gb, pci bus id: 0001:00:00.0, compute capability: 9.0" xla_global_id: 416903419 ] Epoch 1/5 1875/1875 ━━━━━━━━━━━━━━━━━━━━ 10s 5ms/step - accuracy: 0.7798 - loss: 0.6827 Epoch 2/5 1875/1875 ━━━━━━━━━━━━━━━━━━━━ 9s 5ms/step - accuracy: 0.8595 - loss: 0.3803 Epoch 3/5 1875/1875 ━━━━━━━━━━━━━━━━━━━━ 9s 5ms/step - accuracy: 0.8797 - loss: 0.3253 Epoch 4/5 1875/1875 ━━━━━━━━━━━━━━━━━━━━ 9s 5ms/step - accuracy: 0.8930 - loss: 0.2903 Epoch 5/5 1875/1875 ━━━━━━━━━━━━━━━━━━━━ 9s 5ms/step - accuracy: 0.8960 - loss: 0.2788 <keras.src.callbacks.history.History at 0x7f9cdc376350>
以上