Running a vLLM GPU Workload on k3s

pnu-code-place/code-placeRepository for Code Place, Pusan National University's coding practice platform

To deploy the vLLM-based AI assistant for Code Place, the inference server had to run reliably on a GPU node. In a local or single Docker environment, it can feel like things are mostly fine once the GPU is recognized. But on k3s, there were many more conditions to check.

Even when nvidia-smi worked on the host, the Pod sometimes failed to find CUDA libraries. In some cases, the scheduler could not see the GPU resource. In other cases, even after specifying a RuntimeClass, the Pod did not seem to run with the runtime I expected.

In the end, the question was not simply "Does this server have a GPU?" It was "Is Kubernetes actually passing that GPU all the way to the Pod?"

This post is a record of how I checked the conditions needed to run the Code Place AI feature on k3s.

Problem

The errors I encountered looked like separate issues at first.

libcuda.so.1 not found
Triton is installed but 0 active driver(s) found
No CUDA runtime is found
Insufficient nvidia.com/gpu
A Pod not running in the expected environment even with RuntimeClass set

At first glance, some of these looked like library problems, some looked like driver problems, and some looked like scheduling problems. But when I checked them step by step, they pointed in a similar direction. The GPU resource on the host might not be reaching the container properly.

Cause Check

The first thing I did was avoid assuming that all GPU-related errors had one cause. The result may look the same, the GPU is not visible, but the place where it gets blocked can be different.

flowchart TB
    Host["Host Driver\nnvidia-smi"]
    Runtime["NVIDIA Container Runtime"]
    Containerd["k3s containerd\nruntime handler"]
    RuntimeClass["RuntimeClass"]
    DevicePlugin["NVIDIA Device Plugin\nnvidia.com/gpu"]
    Pod["vLLM Pod"]

    Host --> Runtime --> Containerd --> RuntimeClass --> Pod
    DevicePlugin --> Pod

Host

I first checked whether the driver was working, whether nvidia-smi was visible, and whether the CUDA and driver versions matched the requirements. If this part is broken, the problem is already blocked before Kubernetes is involved.

Container runtime

Next, I checked whether nvidia-container-runtime was installed and whether NVIDIA Container Toolkit was configured correctly. The host had to be able to pass GPU devices into containers before moving on to the Kubernetes side.

k3s / containerd

Then I checked whether the NVIDIA runtime was registered in the containerd configuration used by k3s. I also checked whether the RuntimeClass pointed to the actual runtime handler. k3s can have different runtime configuration paths and reload behavior compared to a typical Kubernetes setup, so where the setting is placed matters.

Kubernetes resources

I checked whether the device plugin was advertising the GPU as nvidia.com/gpu. The scheduler has to see that resource for the Pod's GPU request and limit to mean anything.

Workload

Finally, I checked whether the Pod used the correct RuntimeClass, whether the container image and execution method matched the expected CUDA environment, and whether vLLM recognized the GPU and CUDA libraries at startup.

After separating the problem this way, it became easier to tell which stage each error was pointing to.

Error Check

An error like libcuda.so.1 not found can look like the CUDA library is missing from the container image. But in a Kubernetes GPU environment, it does not always mean the image is the only problem.

It can also mean that the NVIDIA libraries and devices from the host were not mounted into the container properly.

So when I saw this message, I checked the following points.

Is this Pod really running with the NVIDIA runtime?
Is the NVIDIA runtime handler registered in the k3s/containerd configuration?
Does the RuntimeClass name match the actual handler name?
Can the container see the GPU device?

After this, I stopped treating GPU-related errors as application library problems right away.

Approach

RuntimeClass and the device plugin were both needed, but their roles were different.

RuntimeClass was the clue that decided which runtime configuration the Pod would use. Even if the host had the NVIDIA runtime installed, the Pod might not receive GPU devices and related libraries correctly unless it was launched through that runtime.

The device plugin was closer to the component that made Kubernetes recognize the GPU as a schedulable node resource. If the device plugin was missing or not working correctly, the scheduler could fail to see nvidia.com/gpu.

I also tried not to read Insufficient nvidia.com/gpu as just a simple lack of GPU capacity. I checked whether the device plugin was failing to advertise the resource, whether another Pod was already using the GPU, and whether node constraints or taints and tolerations were involved.

After Applying

Once the container recognized the GPU and vLLM started running, it felt like the problem was solved. But in reality, that was only the start of the next set of checks. Recognizing the GPU and operating an inference server reliably are different problems.

After that, I still had to deal with vLLM settings, memory usage, concurrent requests, and context length. But in this work, I focused on organizing the conditions required before a GPU inference workload can run on Kubernetes.

Checkpoints

After this experience, my checklist for GPU inference servers became longer. Before, I thought that if a model ran with Docker on a GPU server, one big step was done. Now I check more items before treating it as ready.

The first things I check are:

Can the GPU workload be scheduled onto the intended node?
Does the runtime configuration remain after restart?
Can the scheduler see the GPU resource?
Can I explain what conditions the Pod needs in order to use the GPU?

Using a GPU on Kubernetes includes several assumptions. Having a GPU node, seeing nvidia-smi on the host, and seeing the Pod in Running state are all necessary, but they may still not be enough.

Takeaway

This experience was meaningful not just because I ran vLLM, but because I checked the path one step at a time and organized the conditions needed for a GPU inference workload to work.

After this, when a GPU-related issue happens, I try not to look only at the application logs. I check the host, runtime, k3s containerd, device plugin, and Pod spec in order. Having that order made similar problems easier to narrow down later.