Installing the NVIDIA GPU Operator — NVIDIA GPU Operator 24.6.1 documentation (2024)

  • Prerequisites

  • Procedure

  • Common Chart Customization Options

  • Common Deployment Scenarios

    • Specifying the Operator Namespace

    • Preventing Installation of Operands on Some Nodes

    • Preventing Installation of NVIDIA GPU Driver on Some Nodes

    • Installation on Red Hat Enterprise Linux

    • Pre-Installed NVIDIA GPU Drivers

    • Pre-Installed NVIDIA GPU Drivers and NVIDIA Container Toolkit

    • Pre-Installed NVIDIA Container Toolkit (but no drivers)

    • Running a Custom Driver Image

  • Specifying Configuration Options for containerd

    • Rancher Kubernetes Engine 2

    • MicroK8s

  • Verification: Running Sample GPU Applications

    • CUDA VectorAdd

    • Jupyter Notebook

  • Installation on Commercially Supported Kubernetes Platforms

Prerequisites

  1. You have the kubectl and helm CLIs available on a client machine.

    You can run the following commands to install the Helm CLI:

    $ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \ && chmod 700 get_helm.sh \ && ./get_helm.sh
  2. All worker nodes or node groups to run GPU workloads in the Kubernetes cluster must run the same operating system version to use the NVIDIA GPU Driver container.Alternatively, if you pre-install the NVIDIA GPU Driver on the nodes, then you can run different operating systems.

    For worker nodes or node groups that run CPU workloads only, the nodes can run any operating system becausethe GPU Operator does not perform any configuration or management of nodes for CPU-only workloads.

  3. Nodes must be configured with a container engine such CRI-O or containerd.

  4. If your cluster uses Pod Security Admission (PSA) to restrict the behavior of pods,label the namespace for the Operator to set the enforcement policy to privileged:

    $ kubectl create ns gpu-operator$ kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged
  5. Node Feature Discovery (NFD) is a dependency for the Operator on each node.By default, NFD master and worker are automatically deployed by the Operator.If NFD is already running in the cluster, then you must disable deploying NFD when you install the Operator.

    One way to determine if NFD is already running in the cluster is to check for a NFD label on your nodes:

    $ kubectl get nodes -o json | jq '.items[].metadata.labels | keys | any(startswith("feature.node.kubernetes.io"))'

    If the command output is true, then NFD is already running in the cluster.

Procedure

  1. Add the NVIDIA Helm repository:

    $ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ && helm repo update
  2. Install the GPU Operator.

Common Chart Customization Options

The following options are available when using the Helm chart.These options can be used with --set when installing with Helm.

The following table identifies the most frequently used options.To view all the options, run helm show values nvidia/gpu-operator.

Parameter

Description

Default

ccManager.enabled

When set to true, the Operator deploys NVIDIA Confidential Computing Manager for Kubernetes.Refer to GPU Operator with Confidential Containers and Kata for more information.

false

cdi.enabled

When set to true, the Operator installs two additional runtime classes,nvidia-cdi and nvidia-legacy, and enables the use of the Container Device Interface (CDI)for making GPUs accessible to containers.Using CDI aligns the Operator with the recent efforts to standardize how complex devices like GPUsare exposed to containerized environments.

Pods can specify spec.runtimeClassName as nvidia-cdi to use the functionality orspecify nvidia-legacy to prevent using CDI to perform device injection.

false

cdi.default

When set to true, the container runtime uses CDI to perform device injection by default.

false

daemonsets.annotations

Map of custom annotations to add to all GPU Operator managed pods.

{}

daemonsets.labels

Map of custom labels to add to all GPU Operator managed pods.

{}

devicePlugin.config

Specifies the configuration for the NVIDIA Device Plugin as a config map.

In most cases, this field is configured after installing the Operator, such asto configure Time-Slicing GPUs in Kubernetes.

{}

driver.enabled

By default, the Operator deploys NVIDIA drivers as a container on the system.Set this value to false when using the Operator on systems with pre-installed drivers.

true

driver.repository

The images are downloaded from NGC. Specify another image repository when usingcustom driver images.

nvcr.io/nvidia

driver.rdma.enabled

Controls whether the driver daemon set builds and loads the legacy nvidia-peermem kernel module.

You might be able to use GPUDirect RDMA without enabling this option.Refer to GPUDirect RDMA and GPUDirect Storage for information about whether you can use DMA-BUF oryou need to use legacy nvidia-peermem.

false

driver.rdma.useHostMofed

Indicate if MLNX_OFED (MOFED) drivers are pre-installed on the host.

false

driver.startupProbe

By default, the driver container has an initial delay of 60s before starting liveness probes.The probe runs the nvidia-smi command with a timeout duration of 60s.You can increase the timeoutSeconds duration if the nvidia-smi commandruns slowly in your cluster.

60s

driver.useOpenKernelModules

When set to true, the driver containers install the NVIDIA Open GPU Kernel module driver.

false

driver.usePrecompiled

When set to true, the Operator attempts to deploy driver containers that haveprecompiled kernel drivers.This option is available as a technology preview feature for select operating systems.Refer to the precompiled driver containers page for the supported operating systems.

false

driver.version

Version of the NVIDIA datacenter driver supported by the Operator.

If you set driver.usePrecompiled to true, then set this field toa driver branch, such as 525.

Depends on the version of the Operator. See the Component Matrixfor more information on supported drivers.

gdrcopy.enabled

Enables support for GDRCopy.When set to true, the GDRCopy Driver runs as a sidecar container in the GPU driver pod.For information about GDRCopy, refer to the gdrcopy page.

You can enable GDRCopy if you use the NVIDIA GPU Driver Custom Resource Definition.

false

kataManager.enabled

The GPU Operator deploys NVIDIA Kata Manager when this field is true.Refer to GPU Operator with Kata Containers for more information.

false

mig.strategy

Controls the strategy to be used with MIG on supported NVIDIA GPUs. Optionsare either mixed or single.

single

migManager.enabled

The MIG manager watches for changes to the MIG geometry and applies reconfiguration as needed. Bydefault, the MIG manager only runs on nodes with GPUs that support MIG (for e.g. A100).

true

nfd.enabled

Deploys Node Feature Discovery plugin as a daemonset.Set this variable to false if NFD is already running in the cluster.

true

nfd.nodefeaturerules

Installs node feature rules that are related to confidential computing.NFD uses the rules to detect security features in CPUs and NVIDIA GPUs.Set this variable to true when you configure the Operator for Confidential Containers.

false

operator.labels

Map of custom labels that will be added to all GPU Operator managed pods.

{}

psp.enabled

The GPU operator deploys PodSecurityPolicies if enabled.

false

sandboxWorkloads.defaultWorkload

Specifies the default type of workload for the cluster, one of container, vm-passthrough, or vm-vgpu.

Setting vm-passthrough or vm-vgpu can be helpful if you plan to run all or mostly virtual machines in your cluster.Refer to KubeVirt, Kata Containers, or Confidential Containers.

container

toolkit.enabled

By default, the Operator deploys the NVIDIA Container Toolkit (nvidia-docker2 stack)as a container on the system. Set this value to false when using the Operator on systemswith pre-installed NVIDIA runtimes.

true

Common Deployment Scenarios

The following common deployment scenarios and sample commands apply best tobare metal hosts or virtual machines with GPU passthrough.

Specifying the Operator Namespace

Both the Operator and operands are installed in the same namespace.The namespace is configurable and is specified during installation.For example, to install the GPU Operator in the nvidia-gpu-operator namespace:

$ helm install --wait --generate-name \ -n nvidia-gpu-operator --create-namespace \ nvidia/gpu-operator

If you do not specify a namespace during installation, all GPU Operator components are installed in the default namespace.

Preventing Installation of Operands on Some Nodes

By default, the GPU Operator operands are deployed on all GPU worker nodes in the cluster.GPU worker nodes are identified by the presence of the label feature.node.kubernetes.io/pci-10de.present=true.The value 0x10de is the PCI vendor ID that is assigned to NVIDIA.

To disable operands from getting deployed on a GPU worker node, label the node with nvidia.com/gpu.deploy.operands=false.

$ kubectl label nodes $NODE nvidia.com/gpu.deploy.operands=false

Preventing Installation of NVIDIA GPU Driver on Some Nodes

By default, the GPU Operator deploys the driver on all GPU worker nodes in the cluster.To prevent installing the driver on a GPU worker node, label the node like the following sample command.

$ kubectl label nodes $NODE nvidia.com/gpu.deploy.driver=false

Installation on Red Hat Enterprise Linux

In this scenario, use the NVIDIA Container Toolkit image that is built on UBI 8:

$ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ --set toolkit.version=v1.16.1-ubi8

Replace the v1.16.1 value in the preceding command with the version that is supportedwith the NVIDIA GPU Operator.Refer to the GPU Operator Component Matrix on the platform support page.

When using RHEL8 with Kubernetes, SELinux must be enabled either in permissive or enforcing mode for use with the GPU Operator.Additionally, network restricted environments are not supported.

Pre-Installed NVIDIA GPU Drivers

In this scenario, the NVIDIA GPU driver is already installed on the worker nodes that have GPUs:

$ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ --set driver.enabled=false

Pre-Installed NVIDIA GPU Drivers and NVIDIA Container Toolkit

In this scenario, the NVIDIA GPU driver and the NVIDIA Container Toolkit are already installed onthe worker nodes that have GPUs.

Tip

This scenario applies to NVIDIA DGX Systems that run NVIDIA Base OS.

Before installing the Operator, ensure that the default runtime is set to nvidia.Refer to Configuration in the NVIDIA Container Toolkit documentation for more information.

Install the Operator with the following options:

$ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ --set driver.enabled=false \ --set toolkit.enabled=false

Pre-Installed NVIDIA Container Toolkit (but no drivers)

In this scenario, the NVIDIA Container Toolkit is already installed on the worker nodes that have GPUs.

  1. Configure toolkit to use the root directory of the driver installation as /run/nvidia/driver, because this is the path mounted by driver container.

    $ sudo sed -i 's/^#root/root/' /etc/nvidia-container-runtime/config.toml
  1. Install the Operator with the following options (which will provision a driver):

    $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ --set toolkit.enabled=false

Running a Custom Driver Image

If you want to use custom driver container images, such as version 465.27, thenyou can build a custom driver container image. Follow these steps:

  • Rebuild the driver container by specifying the $DRIVER_VERSION argument when building the Docker image. Forreference, the driver container Dockerfiles are available on the Git repository at https://gitlab.com/nvidia/container-images/driver.

  • Build the container using the appropriate Dockerfile. For example:

    $ docker build --pull -t \ --build-arg DRIVER_VERSION=455.28 \ nvidia/driver:455.28-ubuntu20.04 \ --file Dockerfile .

    Ensure that the driver container is tagged as shown in the example by using the driver:<version>-<os> schema.

  • Specify the new driver image and repository by overriding the defaults inthe Helm install command. For example:

    $ helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ --set driver.repository=docker.io/nvidia \ --set driver.version="465.27"

These instructions are provided for reference and evaluation purposes.Not using the standard releases of the GPU Operator from NVIDIA would mean limitedsupport for such custom configurations.

Specifying Configuration Options for containerd

When you use containerd as the container runtime, the following configurationoptions are used with the container-toolkit deployed with GPU Operator:

toolkit: env: - name: CONTAINERD_CONFIG value: /etc/containerd/config.toml - name: CONTAINERD_SOCKET value: /run/containerd/containerd.sock - name: CONTAINERD_RUNTIME_CLASS value: nvidia - name: CONTAINERD_SET_AS_DEFAULT value: true

If you need to specify custom values, refer to the following sample command for the syntax:

helm install gpu-operator -n gpu-operator --create-namespace \ nvidia/gpu-operator $HELM_OPTIONS \ --set toolkit.env[0].name=CONTAINERD_CONFIG \ --set toolkit.env[0].value=/etc/containerd/config.toml \ --set toolkit.env[1].name=CONTAINERD_SOCKET \ --set toolkit.env[1].value=/run/containerd/containerd.sock \ --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \ --set toolkit.env[2].value=nvidia \ --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \ --set-string toolkit.env[3].value=true

These options are defined as follows:

CONTAINERD_CONFIG

The path on the host to the containerd configyou would like to have updated with support for the nvidia-container-runtime.By default this will point to /etc/containerd/config.toml (the defaultlocation for containerd). It should be customized if your containerdinstallation is not in the default location.

CONTAINERD_SOCKET

The path on the host to the socket file used tocommunicate with containerd. The operator will use this to send aSIGHUP signal to the containerd daemon to reload its config. Bydefault this will point to /run/containerd/containerd.sock(the default location for containerd). It should be customized ifyour containerd installation is not in the default location.

CONTAINERD_RUNTIME_CLASS

The name of theRuntime Classyou would like to associate with the nvidia-container-runtime.Pods launched with a runtimeClassName equal to CONTAINERD_RUNTIME_CLASSwill always run with the nvidia-container-runtime. The defaultCONTAINERD_RUNTIME_CLASS is nvidia.

CONTAINERD_SET_AS_DEFAULT

A flag indicating whether you want to setnvidia-container-runtime as the default runtime used to launch allcontainers. When set to false, only containers in pods with a runtimeClassNameequal to CONTAINERD_RUNTIME_CLASS will be run with the nvidia-container-runtime.The default value is true.

Rancher Kubernetes Engine 2

For Rancher Kubernetes Engine 2 (RKE2), set the following in the ClusterPolicy.

toolkit: env: - name: CONTAINERD_CONFIG value: /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl - name: CONTAINERD_SOCKET value: /run/k3s/containerd/containerd.sock - name: CONTAINERD_RUNTIME_CLASS value: nvidia - name: CONTAINERD_SET_AS_DEFAULT value: "true"

These options can be passed to GPU Operator during install time as below.

helm install gpu-operator -n gpu-operator --create-namespace \ nvidia/gpu-operator $HELM_OPTIONS \ --set toolkit.env[0].name=CONTAINERD_CONFIG \ --set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl \ --set toolkit.env[1].name=CONTAINERD_SOCKET \ --set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \ --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \ --set toolkit.env[2].value=nvidia \ --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \ --set-string toolkit.env[3].value=true

MicroK8s

For MicroK8s, set the following in the ClusterPolicy.

toolkit: env: - name: CONTAINERD_CONFIG value: /var/snap/microk8s/current/args/containerd-template.toml - name: CONTAINERD_SOCKET value: /var/snap/microk8s/common/run/containerd.sock - name: CONTAINERD_RUNTIME_CLASS value: nvidia - name: CONTAINERD_SET_AS_DEFAULT value: "true"

These options can be passed to GPU Operator during install time as below.

helm install gpu-operator -n gpu-operator --create-namespace \ nvidia/gpu-operator $HELM_OPTIONS \ --set toolkit.env[0].name=CONTAINERD_CONFIG \ --set toolkit.env[0].value=/var/snap/microk8s/current/args/containerd-template.toml \ --set toolkit.env[1].name=CONTAINERD_SOCKET \ --set toolkit.env[1].value=/var/snap/microk8s/common/run/containerd.sock \ --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \ --set toolkit.env[2].value=nvidia \ --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \ --set-string toolkit.env[3].value=true

Verification: Running Sample GPU Applications

CUDA VectorAdd

In the first example, let’s run a simple CUDA sample, which adds two vectors together:

  1. Create a file, such as cuda-vectoradd.yaml, with contents like the following:

    apiVersion: v1kind: Podmetadata: name: cuda-vectoraddspec: restartPolicy: OnFailure containers: - name: cuda-vectoradd image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04" resources: limits: nvidia.com/gpu: 1
  2. Run the pod:

    $ kubectl apply -f cuda-vectoradd.yaml

    The pod starts, runs the vectorAdd command, and then exits.

  3. View the logs from the container:

    $ kubectl logs pod/cuda-vectoradd

    Example Output

    [Vector addition of 50000 elements]Copy input data from the host memory to the CUDA deviceCUDA kernel launch with 196 blocks of 256 threadsCopy output data from the CUDA device to the host memoryTest PASSEDDone
  4. Removed the stopped pod:

    $ kubectl delete -f cuda-vectoradd.yaml

    Example Output

    pod "cuda-vectoradd" deleted

Jupyter Notebook

You can perform the following steps to deploy Jupyter Notebook in your cluster:

  1. Create a file, such as tf-notebook.yaml, with contents like the following example:

    ---apiVersion: v1kind: Servicemetadata: name: tf-notebook labels: app: tf-notebookspec: type: NodePort ports: - port: 80 name: http targetPort: 8888 nodePort: 30001 selector: app: tf-notebook---apiVersion: v1kind: Podmetadata: name: tf-notebook labels: app: tf-notebookspec: securityContext: fsGroup: 0 containers: - name: tf-notebook image: tensorflow/tensorflow:latest-gpu-jupyter resources: limits: nvidia.com/gpu: 1 ports: - containerPort: 8888 name: notebook
  2. Apply the manifest to deploy the pod and start the service:

    $ kubectl apply -f tf-notebook.yaml
  3. Check the pod status:

    $ kubectl get pod tf-notebook

    Example Output

    NAMESPACE NAME READY STATUS RESTARTS AGEdefault tf-notebook 1/1 Running 0 3m45s
  4. Because the manifest includes a service, get the external port for the notebook:

    $ kubectl get svc tf-notebook

    Example Output

    NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEtf-notebook NodePort 10.106.229.20 <none> 80:30001/TCP 4m41s
  5. Get the token for the Jupyter notebook:

    $ kubectl logs tf-notebook

    Example Output

    [I 21:50:23.188 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret[I 21:50:23.390 NotebookApp] Serving notebooks from local directory: /tf[I 21:50:23.391 NotebookApp] The Jupyter Notebook is running at:[I 21:50:23.391 NotebookApp] http://tf-notebook:8888/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9[I 21:50:23.391 NotebookApp] or http://127.0.0.1:8888/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9[I 21:50:23.391 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).[C 21:50:23.394 NotebookApp]To access the notebook, open this file in a browser: file:///root/.local/share/jupyter/runtime/nbserver-1-open.htmlOr copy and paste one of these URLs: http://tf-notebook:8888/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9or http://127.0.0.1:8888/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9

The notebook should now be accessible from your browser at this URL:http://your-machine-ip:30001/?token=3660c9ee9b225458faaf853200bc512ff2206f635ab2b1d9.

Installation on Commercially Supported Kubernetes Platforms

Product

Documentation

Red Hat OpenShift 4

using RHCOS worker nodes

NVIDIA GPU Operator on Red Hat OpenShift Container Platform

VMware vSphere with Tanzu

and NVIDIA AI Enterprise

NVIDIA AI Enterprise VMware vSphere Deployment Guide

Google Cloud Anthos

NVIDIA GPUs with Google Anthos

Installing the NVIDIA GPU Operator — NVIDIA GPU Operator 24.6.1 documentation (2024)

References

Top Articles
Oanda vs Coinbase 2024 | Comparison Table and Reviews
The Simpson family Meet Extended Simpson family vs Extended Bouvier family The Family of the Crossover
Spn 1816 Fmi 9
Cintas Pay Bill
Lighthouse Diner Taylorsville Menu
Beds From Rent-A-Center
Green Bay Press Gazette Obituary
Farmers Branch Isd Calendar
Buckaroo Blog
Citi Card Thomas Rhett Presale
Progressbook Brunswick
Zoebaby222
Craigslist Labor Gigs Albuquerque
Nier Automata Chapter Select Unlock
How Much Is Tj Maxx Starting Pay
Bjork & Zhulkie Funeral Home Obituaries
Michaels W2 Online
Download Center | Habasit
Destiny 2 Salvage Activity (How to Complete, Rewards & Mission)
Rams vs. Lions highlights: Detroit defeats Los Angeles 26-20 in overtime thriller
List of all the Castle's Secret Stars - Super Mario 64 Guide - IGN
Union Ironworkers Job Hotline
Everything you need to know about Costco Travel (and why I love it) - The Points Guy
Joann Ally Employee Portal
2024 INFINITI Q50 Specs, Trims, Dimensions & Prices
1989 Chevy Caprice For Sale Craigslist
When Does Subway Open And Close
kvoa.com | News 4 Tucson
Hctc Speed Test
Apparent assassination attempt | Suspect never had Trump in sight, did not get off shot: Officials
Black Lion Backpack And Glider Voucher
Himekishi Ga Classmate Raw
Verizon TV and Internet Packages
Sports Clips Flowood Ms
Shiftwizard Login Johnston
A Small Traveling Suitcase Figgerits
Unifi Vlan Only Network
2007 Peterbilt 387 Fuse Box Diagram
Lima Crime Stoppers
Mugshots Journal Star
At Home Hourly Pay
How Much Is 10000 Nickels
Autum Catholic Store
Craigslist Central Il
Garland County Mugshots Today
Noh Buddy
How To Customise Mii QR Codes in Tomodachi Life?
Ferhnvi
Lesly Center Tiraj Rapid
Unblocked Games 6X Snow Rider
Charlotte North Carolina Craigslist Pets
Southwind Village, Southend Village, Southwood Village, Supervision Of Alcohol Sales In Church And Village Halls
Latest Posts
Article information

Author: Aron Pacocha

Last Updated:

Views: 5774

Rating: 4.8 / 5 (68 voted)

Reviews: 91% of readers found this page helpful

Author information

Name: Aron Pacocha

Birthday: 1999-08-12

Address: 3808 Moen Corner, Gorczanyport, FL 67364-2074

Phone: +393457723392

Job: Retail Consultant

Hobby: Jewelry making, Cooking, Gaming, Reading, Juggling, Cabaret, Origami

Introduction: My name is Aron Pacocha, I am a happy, tasty, innocent, proud, talented, courageous, magnificent person who loves writing and wants to share my knowledge and understanding with you.