Configure Kubernetes Cluster to Use Nvidia GPU

Tram Ho

As an open source platform for automating the deployment, scaling, and management of containerized applications, Kubernetes is often chosen for the deployment of web services in general and that includes applications with using ML models. On that basis, in this article, we will learn about how Kubernetes components interact with Nvidia GPUs as well as practice with a simple k8s cluster.

Nvidia GPU on Kubernetes

A little recap, though, is that Kubernetes manages its units called Pod which are designed to emulate application-specific logic servers and can contain different linked application containers. relatively tight. Containers in a Pod share an IP address and port space, they are always co-located, scheduled, and run in a shared context on the same Node .

image.png

Image used from https://kubernetes.io/en/docs/tutorials/kubernetes-basics/explore/explore-intro/

Building a Kubernetes cluster for testing is quite simple, as we can use the k3s installation via a single command as follows:

For those of you wondering, I’m not deploying traefik because this Ingress is not very popular, after testing other tools such as Seldon Core , I have to remove it first, so I won’t install it from the beginning ┐( ̄ヮ ̄)┌ . In this demo, I will have a cluster with 2 node that is because my machine does not have a GPU and the other machine has (¬_¬;) Installing a new node and adding it to the cluster will be done easily through via the command curl -sfL https://get.k3s.io | K3S_URL=https://10.0.37.144:6443 K3S_TOKEN=$K3S_TOKEN sh - with the value of K3S_TOKEN taken from /var/lib/rancher/k3s/server/node-token on node master . If you are not too black the result will look like this:

Here we have node b120639-pc3 which will be the node with GPU and node b122436-pc is my (;⌣̀_⌣́) So to try and see if a default k8s cluster can use Nvidia GPU, we try deploying a pod with standard content just like a textbook , having to leave the requests field in the manifest because the GPU is always a special resource, it will take a long time to use it without asking:

It would be nice when it’s ok but life is not that easy =)))))))) (otherwise why would I write this article) we will get the following result:

So to save this case, we will learn a little bit below.

Kubernetes Device Plugin for GPU

Currently, Kubernetes is supporting the management of GPUs (not only Nvidia but also AMD and Intel) through Device Plugin , but they will not be pre-installed, but we need to do it ourselves through installing GPU driver . and configure the corresponding device plugin based on the GPU manufacturer’s instructions. Once the plugin is installed successfully, our cluster will expose a custom schedulable resource such as amd.com/gpu or nvidia.com/gpu . There will be a little caveat when using them as follows:

  • You can specify GPU limits without specifying requests because Kubernetes will use limits as the request value by default.
  • You can specify GPU in both limits and requests , but the two values ​​must be equal.
  • You cannot specify GPU requests without specifying limits .

Then a sample manifest used to use the GPU with a pod would look like this:

Then, if different node in the cluster have different types of GPUs, then we can use node label and node selector to schedule pod on the appropriate node like this:

NVIDIA device plugin for Kubernetes

To be able to support customers in using NVIDIA GPUs for Kubernetes clusters, NVIDIA has made the NVIDIA device plugin available at https://github.com/NVIDIA/k8s-device-plugin . Essentially the NVIDIA device plugin built for Kubernetes is a Daemonset that allows you to automatically:

  • Show the number of GPUs per node in the cluster
  • Monitor GPU health
  • Run GPU-intensive container on the cluster

Installation can be done by a number of steps listed in https://github.com/NVIDIA/k8s-device-plugin#quick-start , though you won’t have to do them yourself, these steps will be done through NVIDIA GPU Operator

NVIDIA GPU Operator on Kubernetes Cluster

image.png

As demonstrated in the previous section, Kubernetes provides access to special hardware resources such as GPU NVIDIA , NIC , Infiniband adapters and others through the device plugin framework . However, configuring and managing node with these hardware resources requires configuration of many software components such as drivers , container runtime or other libraries, which is difficult and error prone.

To get around this, the NVIDIA GPU Operator is built through the operator framework in Kubernetes to automate the management of all the NVIDIA software components needed to provision the GPU. These components include:

The NVIDIA GPU Operator implementation is via Helm Chart, and it can be easily installed by following these steps:

First, we need to install helm if not already available via the command:

And then add the NVIDIA Helm repository as follows:

The GPU Operator ‘s Chart will be used to automatically install the necessary components but before using them, we need to review them at https://github.com/NVIDIA/gpu-operator/tree/master/deployments/ gpu-operator . If nothing needs to be changed, we create a release from that chart with the following command:

For machines with NVIDIA driver available, we can skip the driver configuration to avoid problems by setting driver.enabled to false as follows:

Similarly, if the runtime environment is already configured to use the NVIDIA Container Toolkit , you can customize the value of the chart release as follows:

Finally, after a billion years of waiting, we will have a new namespace named gpu-operator created and a billion other things in it like this:

To check if the gpu-operator is working, we can use the manifest to create a Deployment as follows:

The results obtained will be as follows:

This means that the pod have a scheduled Nvidia GPU usage requirement and my preset was fortunately error free.

☆*:.。.o(≧▽≦)o.。.:*☆ ☆*:.。.o(≧▽≦)o.。.:*☆ ☆*:.。.o(≧▽≦)o.。.:*☆

MIG Support on Kubernetes

So as above, we can configure the components on Kubernetes to be able to use the GPU. However, nvidia.com/gpu resources by default are only counted according to the number of registered GPUs, so it is impossible for two pods to use the same GPU when we are not allowed to configure as nvidia.com/gpu: 0.5 . That leads to a waste of resources during GPU usage when modern GPUs often have large VRAM (e.g. 12GB with a 2080Ti) and not all machine learning models use up such a large amount of VRAM.

To better visualize, when increasing replicas to 2 in the manifest above, we will immediately encounter the error 0/2 nodes are available: 2 Insufficient nvidia.com/gpu. preemption: 0/2 nodes are available: 2 No preemption victims found for incoming pod. causing the new pod to not be scheduled. To solve this problem, Yen Vi introduced the Multi-Instance GPU solution by dividing the GPU into seven instances, each completely isolated from high-bandwidth memory, cache, and compute cores. its own. This gives you the ability to support any workload, from the smallest to the largest, with guaranteed quality of service (QoS) and extend the reach of accelerated computing resources. for all users.

image.png

However, life is not so easy when Multi-Instance GPU only supports GPUs based on Ampere architecture like NVIDIA A100 or more specifically H100 , A100 , A30 so we need another solution when we don’t have money. (눈_눈)(눈_눈)(눈_눈)

Time-Slicing GPUs in Kubernetes

As introduced above, the latest generations of NVIDIA GPUs offer an operating mode known as Multi-Instance GPU or MIG . MIG allows us to partition the GPU into many smaller, predefined instances, each of which looks like a mini GPU providing memory and fault isolation at the hardware layer. With such a split, you can share access to the GPU by running the workload on one of these predefined instances instead of the full native GPU.

However if:

  • You don’t like to use MIG because it’s abbreviated as your ex’s name
  • You are willing to trade the isolation provided by the MIG for the ability to share the GPU by a larger number of users
  • You don’t have money to buy a new GPU

To address this issue, the NVIDIA GPU Operator enables GPU quá mức -registration through an extensive set of options for the NVIDIA Kubernetes Device Plugin to allow workloads placed on registered GPUs to alternate with together. This GPU “time sharing” mechanism in Kubernetes allows the system to define a set of “replicas” for the GPU, each of which can be independently distributed to a group to run workloads. job. Unlike MIG, there’s no memory or error isolation between replicas, but for some people sometimes nobody cares and the Time-Slicing của GPU mechanism is used to splicing volumes. work from copies of the same underlying GPU.

To configure shared access to the GPU with GPU Time-Slicing , we need to provide a time-slicing configuration to the NVIDIA Kubernetes Device Plugin like the following ConfigMap :

A sample ConfigMap provided by the textbook is as follows:

With the above configuration, we can allow 4 pod to use the same T4 . Similarly, the config for an a100 will look like this:

I’m not that rich, so the config I used to experiment with the 2080Ti will be as follows:

To enable time-slicing with NVIDIA GPU Operator by passing devicePlugin.config.name to the name of the ConfigMap parameter created above as follows:

The time-slicing configuration can be applied at the cluster level or per node . By default, the GPU Operator will not apply the time-slicing configuration to any GPU nodes in the cluster and we will have to explicitly specify it with devicePlugin.config.default and we can update it with the following command. :

To check if the configuration is ok, we can check the node information as follows:

The above output shows that our configuration has been successful and the SHARED suffix has been added to the name of gpu.product

And now the basic unit used in resource management on this node will be 1/6 of the GPU, so we can configure the Deployment above as follows:

The results obtained will be:

summary

This article covers how Kubernetes components interact with Nvidia GPUs and talks about solutions to use GPUs on a Kubernetes cluster more efficiently through Time-Slicing . Configuring a Kubernetes cluster to use Nvidia GPUs is a fundamental step to deploying services that use Nvidia GPUs such as Triton Inference Server, and in the next article (if I do), we will learn about it together. serving solution for this model. This is the end of the article, thank you all for taking the time to read.

References

Share the news now

Source : Viblo