{"id":2257,"date":"2025-06-11T10:34:43","date_gmt":"2025-06-11T01:34:43","guid":{"rendered":"https:\/\/www.kwonline.org\/memo2\/?p=2257"},"modified":"2025-06-26T13:58:22","modified_gmt":"2025-06-26T04:58:22","slug":"run-nvidia-a100-mig-on-azure-aks","status":"publish","type":"post","link":"https:\/\/www.kwonline.org\/memo2\/2025\/06\/11\/run-nvidia-a100-mig-on-azure-aks\/","title":{"rendered":"Azure AKS \u3067 Nvidia MIG \u3092\u4f7f\u3046#1"},"content":{"rendered":"<p>&nbsp;<br \/>\n\u5148\u65e5\u306b\u3064\u3065\u3044\u3066 MIG \u3092\u8a66\u3057\u305f\u306e\u3067\u30e1\u30e2\u3002<\/p>\n<p><a href=\"https:\/\/www.kwonline.org\/memo2\/2025\/05\/13\/enable-nvidia-h100-gpu-mig\/\" target=\"_blank\">Nvidia H100 GPU \u3067 MIG \u3092\u4f7f\u3046<\/a><\/p>\n<p>\u4eca\u5ea6\u306f AKS \u3092\u4f7f\u3063\u3066 Kubernetes \u3068 MIG \u306e\u7d44\u307f\u5408\u308f\u305b\u3002<br \/>\n\u3084\u3063\u3071 K8S \u3067\u4f7f\u3063\u305f\u65b9\u304c\u52b9\u7387\u826f\u3044\u3002 <\/p>\n<p>\u6700\u521d\u306f H100 \u642d\u8f09\u306e Standard_NC40ads_H100_v5 \u3067\u8a66\u3057\u305f\u304c\u3001\u4e0b\u8a18\u306e\u30d0\u30b0\u306e\u305f\u3081 nodepool \u306b\u53c2\u52a0\u51fa\u6765\u306a\u304b\u3063\u305f\u306e\u3067 A100 \u642d\u8f09\u306e standard_nc24ads_a100_v4 \u3067\u8a66\u3057\u305f\u3002<\/p>\n<p><a href=\"https:\/\/github.com\/Azure\/AKS\/issues\/5045\" target=\"_blank\">[BUG] Standard_NC40ads_H100_v5 node pool fails to start with AKSUbuntu-2204gen2containerd-202505.14.0 image<\/a><\/p>\n<p>AKS cluster \u3092 CLI \u3067\u4f5c\u6210\u3059\u308b\u3002 cluster \u306b ACR \u3082\u30a2\u30bf\u30c3\u30c1\u3057\u3068\u304f\u3002<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n# Create AKS cluster with ACR attached\r\naz aks create --resource-group rg-oreno \\\r\n    --name orenoAKSCluster --generate-ssh-keys --location japaneast \\\r\n    --attach-acr orenoacr\r\n<\/pre>\n<p>\u30af\u30e9\u30b9\u30bf\u30fc\u304c\u51fa\u6765\u4e0a\u304c\u3063\u305f\u3089 .kube\/config \u3092\u4f5c\u6210<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n# Get credentials after AKS cluster is up and running\r\naz aks get-credentials --resource-group rg-oreno --name orenoAKSCluster --admin --overwrite-existing\r\n<\/pre>\n<p>\u7d9a\u3044\u3066 MIG \u30ce\u30fc\u30c9\u3092\u4f5c\u6210<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\naz aks nodepool add --name aksmignode --resource-group rg-oreno \\\r\n    --cluster-name orenoAKSCluster --node-vm-size standard_nc24ads_a100_v4 \\\r\n    --node-count 1 --gpu-instance-profile MIG1g\r\n<\/pre>\n<p>helm \u3067 AKS \u306b Nvidia\u3000plugin \u3092\u30a4\u30f3\u30b9\u30c8\u30fc\u30eb\u3059\u308b<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n# Install NVIDIA device plugin\r\nhelm repo add nvdp https:\/\/nvidia.github.io\/k8s-device-plugin\r\nhelm repo update\r\n\r\nhelm install nvdp nvdp\/nvidia-device-plugin --version=0.17.0 --set migStrategy=mixed \\\r\n    --set gfd.enabled=true --namespace nvidia-device-plugin --create-namespace\r\n<\/pre>\n<p>MIG \u30ce\u30fc\u30c9\u304c\u51fa\u6765\u4e0a\u304c\u3063\u305f\u304b\u78ba\u8a8d\u3002<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\nkubectl get node\r\n<\/pre>\n<p>aks-aksmignode \u3068\u3044\u3046\u30ce\u30fc\u30c9\u304c\u898b\u3064\u304b\u308b\u306e\u3067\u3001 describe \u3059\u308b<\/p>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\nkubectl describe node aks-aksmignode-12345678-vmss000000\r\n<\/pre>\n<p>Allocatable \u306b nvidia.com\/mig \u3092\u542b\u3080\u9805\u76ee\u304c\u51fa\u3066\u304f\u308c\u3070 OK.<\/p>\n<pre class=\"brush: yaml; title: ; notranslate\" title=\"\">\r\nAllocatable:\r\n    nvidia.com\/mig-1g.10gb:  7\r\n<\/pre>\n<p>\u8a66\u3057\u306b pod \u3092\u30b9\u30b1\u30b8\u30e5\u30fc\u30eb\u3057\u3066 MIG \u3092\u8a8d\u8b58\u3057\u3066\u308b\u304b\u8a66\u3059\u3002<\/p>\n<p>mig-test.yaml<\/p>\n<pre class=\"brush: yaml; title: ; notranslate\" title=\"\">\r\napiVersion: v1\r\nkind: Pod\r\nmetadata:\r\n  name: mig-test\r\nspec:\r\n  nodeSelector:\r\n    node.kubernetes.io\/instance-type: standard_nc24ads_a100_v4\r\n  containers:\r\n  - name: mig-test\r\n    image: nvidia\/cuda:12.1.1-base-ubuntu22.04\r\n    command: &#x5B;&quot;\/bin\/sh&quot;]\r\n    args: &#x5B;&quot;-c&quot;,&quot;sleep 100&quot;]\r\n    resources:\r\n      limits:\r\n        &quot;nvidia.com\/mig-1g.10gb&quot;: 1\r\n<\/pre>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\n$ kubectl apply -f mig-test.yaml\r\npod\/mig-test created\r\n\r\n$ kubectl get pod\r\nNAME       READY   STATUS    RESTARTS   AGE\r\nmig-test   1\/1     Running   0          9s\r\n\r\n$ kubectl exec mig-test -- nvidia-smi -L\r\nGPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-22dd61ea-460c-05ff-9cef-ce1769e463a8)\r\n  MIG 1g.10gb     Device  0: (UUID: MIG-aa23f925-0631-51b1-8754-0bfb379f6e5a)\r\n<\/pre>\n<p>\u3061\u3083\u3093\u3068 MIG 1g.10gb \u306e\u30a4\u30f3\u30b9\u30bf\u30f3\u30b9\u3092\u8a8d\u8b58\u3057\u3066\u308b\u306e\u3067\u6210\u529f\u3002<\/p>\n<p>\u53c2\u8003URL: <a href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/aks\/gpu-multi-instance?tabs=azure-cli\" target=\"_blank\">https:\/\/learn.microsoft.com\/en-us\/azure\/aks\/gpu-multi-instance?tabs=azure-cli<\/a><\/p>\n<p>Part 2 \u306b\u7d9a\u304f<\/p>\n<p>\u7d9a\u304d: <a href=\"https:\/\/www.kwonline.org\/memo2\/2025\/06\/11\/run-nvidia-a100-mig-on-azure-aks-part2\/\">Azure AKS \u3067 Nvidia MIG \u3092\u4f7f\u3046#2<\/a><br \/>\n&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>&nbsp; \u5148\u65e5\u306b\u3064\u3065\u3044\u3066 MIG \u3092\u8a66\u3057\u305f\u306e\u3067\u30e1\u30e2\u3002 Nvidia H100 GPU \u3067 MIG \u3092\u4f7f\u3046 \u4eca\u5ea6\u306f AKS \u3092\u4f7f\u3063\u3066 Kubernetes \u3068 MIG \u306e\u7d44\u307f\u5408\u308f\u305b\u3002 \u3084\u3063\u3071 K8S \u3067\u4f7f\u3063\u305f\u65b9\u304c\u52b9\u7387\u826f [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[22,32,8],"tags":[],"class_list":["post-2257","post","type-post","status-publish","format-standard","hentry","category-azure","category-kubernetes","category-linux"],"_links":{"self":[{"href":"https:\/\/www.kwonline.org\/memo2\/wp-json\/wp\/v2\/posts\/2257","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.kwonline.org\/memo2\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.kwonline.org\/memo2\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.kwonline.org\/memo2\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kwonline.org\/memo2\/wp-json\/wp\/v2\/comments?post=2257"}],"version-history":[{"count":12,"href":"https:\/\/www.kwonline.org\/memo2\/wp-json\/wp\/v2\/posts\/2257\/revisions"}],"predecessor-version":[{"id":2289,"href":"https:\/\/www.kwonline.org\/memo2\/wp-json\/wp\/v2\/posts\/2257\/revisions\/2289"}],"wp:attachment":[{"href":"https:\/\/www.kwonline.org\/memo2\/wp-json\/wp\/v2\/media?parent=2257"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.kwonline.org\/memo2\/wp-json\/wp\/v2\/categories?post=2257"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.kwonline.org\/memo2\/wp-json\/wp\/v2\/tags?post=2257"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}