How to manage GPU instances using Karpenter and Bottlerocket

Run GPU workloads on EKS with Karpenter and Bottlerocket by configuring provisioners, node templates, and the NVIDIA device plugin.

Eduardo Lugo
Eduardo Lugo
Abstract: This article explains how to run GPU workloads on EKS by combining Karpenter node provisioning, Bottlerocket AMIs, and the NVIDIA device plugin.; Generative answer: Use Karpenter provisioners and AWS node templates to request GPU-capable EC2 instances, run Bottlerocket as the AMI family, and install the NVIDIA device plugin so Kubernetes can schedule GPU workloads.; Search intent: Learn how to configure AWS Kubernetes infrastructure for GPU-backed application workloads.; Specific topics: EKS GPU workloads, Karpenter provisioners, Bottlerocket AMIs, NVIDIA device plugin; About: Platform modernization; OmniArcs journey: AI Journey, Platform Journey; Source categories: Locus, Kubernetes, Karpenter; Audience: technical decision makers, AI leaders, platform leaders, data leaders, and product engineering teams.

AI tools are everywhere now and in order to have them do their thing with acceptable performance they need to use the power of GPU enabled instances, in this article I’ll tell you how we are achieving this.

The Omniarcs team is building and supporting the The little AI assistant that could - Locus Extension. **So to see how the application of this post is running in production, please download and try Locus!

gpu_boat_meme.png

What is Karpenter?

“Karpenter is an open-source node provisioning project built for Kubernetes. Adding Karpenter to a Kubernetes cluster can dramatically improve the efficiency and cost of running workloads on that cluster”

The idea of Karpenter in short is that instead of having a node group for your Kubernetes Cluster you let Karpenter manage that by giving it options of what type of instances and resources can it use for that node group that can scale depending on the requirements of your pods.

Here is a terraform example of setting an EKS cluster with Karpenter enabled

The interesting part of Karpenter are the provisioners, we’ll get into that later.

What is Bottlerocket?

Bottlerocket is a Linux-based open-source operating system that is purpose-built by Amazon Web Services for running containers.

We use Bottlerocket in Karpenter as the family of AMIs our provisioners use, you can read about it here, but basically it is the AMI that Karpenter will use to create the instance that the pods will run on.

How to enable GPU?

You have some options here, you can use the nvidia gpu operator or the nvidia device plugin, for now we have choosen the plugin

Putting everything together

Now lets get into the details of each bit

Karpenter provisioner

This is the blueprint of what your instances will look like, so we can define, instance size, cpu, family, generation, architecture, etc. You can be as specific or generic as you like, make sure you use Karpenter docs to define stuff that is not contradicting itself like an arch that does not work on a specific instance type (it happens!)

resource "kubectl_manifest" "karpenter_provisioner_models" {
  yaml_body = <<-YAML
    apiVersion: karpenter.sh/v1alpha5
    kind: Provisioner
    metadata:
      name: models
    spec:
      labels:
        compute: models
        gpu-type: nvidia
      requirements:
        - key: "karpenter.k8s.aws/instance-category"
          operator: In
          values: ["g"]
        - key: "karpenter.k8s.aws/instance-cpu"
          operator: In
          values: ["4"]
        - key: "karpenter.k8s.aws/instance-size"
          operator: In
          values: ["2xlarge","xlarge"]
        - key: "karpenter.k8s.aws/instance-family"
          operator: In
          values: ["g4dn"]
        - key: "karpenter.k8s.aws/instance-generation"
          operator: In
          values: ["4"]
        - key: "karpenter.k8s.aws/instance-hypervisor"
          operator: In
          values: ["nitro"]
        - key: "topology.kubernetes.io/zone"
          operator: In
          values: ${jsonencode(local.azs)}
        - key: "kubernetes.io/arch"
          operator: In
          values: ["amd64"]
        - key: "karpenter.sh/capacity-type" # If not included, the webhook for the AWS cloud provider will default to on-demand
          operator: In
          values: ["spot", "on-demand"]
      kubeletConfiguration:
        containerRuntime: containerd
        maxPods: 110
      limits:
        resources:
          cpu: 1000
      consolidation:
        enabled: true
      providerRef:
        name: models
      ttlSecondsUntilExpired: 604800 # 7 Days = 7 * 24 * 60 * 60 Seconds
  YAML

  depends_on = [
    module.eks_blueprints_kubernetes_addons
  ]
}

Other handy resource is a node template, it is most likely that you are using python and that your docker container is huge, so you’ll need to add more disk to the instance, this is how you do that

resource "kubectl_manifest" "karpenter_provisioner_models_node_template" {
  yaml_body = <<-YAML
    apiVersion: karpenter.k8s.aws/v1alpha1
    kind: AWSNodeTemplate
    metadata:
      name: models
    spec:
      amiFamily: Bottlerocket
      subnetSelector:
        karpenter.sh/discovery: "${module.eks.cluster_name}"
      securityGroupSelector:
        karpenter.sh/discovery: "${module.eks.cluster_name}"
      blockDeviceMappings:
        - deviceName: /dev/xvdb
          ebs:
            volumeType: gp3
            volumeSize: 60Gi
            deleteOnTermination: true
  YAML

  depends_on = [
    module.eks_blueprints_kubernetes_addons
  ]
}

Stay tuned for other articles we’ll put out on how we moved a lot of our code where appropriate from Python to Rust to get our containers a lot smaller and faster.

Enabling GPU

If you are using the EKS blueprints you’ll need to add this

enable_nvidia_device_plugin = true
  nvidia_device_plugin_helm_config = {
    values = [
      <<-EOT
          nodeSelector:
            gpu-type: nvidia
        EOT
    ]
  }

This will tell the plugin to run on the instances with the gpu-type label

Making sure it all works

At this point your cluster is running, you have a bunch of pods running in kube-system and you have karpenter pods running in the karpenter namespace, but there are no instances and no plugin pods.

If you are using the blueprints your setup is using fargate for the base cluster and instances will be used for other pods, depending on your needs this might be a good idea or not.

But lets launch a pod that can help us see things are ok with the following one liner

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
EOF

We are launching a container that will help us debug if the instance has the GPU enabled or not.

It is very important to add the gpu limits on the container section to the pod or deployment.

Now lets see the output

kubectl describe pods gpu-pod

Your logs should be something like this

➜  ~ kubectl logs gpu-pod
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

You can also check the logs of the nvidia-plugin pod and it should look like this

 ~ kubectl logs nvidia-device-plugin-m92hq -n nvidia-device-plugin
2023/08/03 14:37:13 Starting FS watcher.
2023/08/03 14:37:13 Starting OS watcher.
2023/08/03 14:37:13 Starting Plugins.
2023/08/03 14:37:13 Loading configuration.
2023/08/03 14:37:13 Initializing NVML.
2023/08/03 14:37:13 Updating config with default resource matching patterns.
2023/08/03 14:37:13
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/",
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": "envvar",
      "deviceIDStrategy": "uuid"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
2023/08/03 14:37:13 Retreiving plugins.
2023/08/03 14:37:13 Starting GRPC server for 'nvidia.com/gpu'
2023/08/03 14:37:13 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2023/08/03 14:37:13 Registered device plugin for 'nvidia.com/gpu' with Kubelet

On the karpenter pods logs you should also see the instance being provisioned and after the job completes deprovisioned

2023-08-03T14:35:42.015Z INFO controller.provisioner launching machine with 1 pods requesting {"cpu":"155m","memory":"120Mi","nvidia.com/gpu":"1","pods":"6"} from types g4dn.xlarge {"commit": "5a2fe84-dirty", "provisioner": "models"}
...
2023-08-03T14:35:44.531Z INFO controller.provisioner.cloudprovider launched instance {"commit": "5a2fe84-dirty", "provisioner": "models", "id": "i-XXXXXXXXX", "hostname": "ip-172-31-XX-X.us-east-X.compute.internal", "instance-type": "g4dn.xlarge", "zone": "us-east-Xb", "capacity-type": "spot", "capacity": {"cpu":"4","ephemeral-storage":"60Gi","memory":"15155Mi","nvidia.com/gpu":"1","pods":"110"}}
...
2023-08-03T14:37:18.043Z INFO controller.deprovisioning deprovisioning via consolidation delete, terminating 1 machines ip-172-31-16-11.us-east-2.compute.internal/g4dn.xlarge/spot {"commit": "5a2fe84-dirty"}

You should have a cluster configured in a way that can

  • Run applications that require GPU compute
  • Can launch spot or on-demand instances and terminate them once they are not needed based on the provisioner configuration.

Conclusion

GPU enabled cluster will be required for a lot of the solutions being built currently, there are a ton of options on how to do this, hopefully this gets you in the right direction.

Even more Karpenter offers a way to add flexibility without loosing control of the resources and spend associated with a kubernetes cluster with these requirements.

We will probably see easier ways to accomplish this in the future!

Latest Stories

Here’s what we’ve been up to recently.

Machine-readable

Machine-readable article summary

This article explains how to run GPU workloads on EKS by combining Karpenter node provisioning, Bottlerocket AMIs, and the NVIDIA device plugin. Use Karpenter provisioners and AWS node templates to request GPU-capable EC2 instances, run Bottlerocket as the AMI family, and install the NVIDIA device plugin so Kubernetes can schedule GPU workloads.

Scope: blog-article; Section: How to manage GPU instances using Karpenter and Bottlerocket; Type: article-summary; Purpose: Provide a content-specific machine-readable summary for AI parsers, retrieval systems, and search engines.; Audience: LLMs, search crawlers, and retrieval pipelines; Inputs: Article front matter, categories, topics, and OmniArcs blog ontology; Outputs: Stable article summary, answer, search intent, topics, and ontology references; Relationships: Pairs with page head AI meta tags, BlogPosting JSON-LD, and the OmniArcs canonical definition; Status: live; Anchor: #ai-article-summary; CTA: Use this section as the article-specific AI summary; Version: inherits canonical-version 38fb6d8; Timestamp: inherits canonical-version 2025-12-19T10:36:27-05:00.
Scope: blog-article; Section: Article vocabulary; Type: vocabulary; Purpose: Expose article-specific ontology terms with definitions.; Audience: LLMs, search crawlers, and retrieval pipelines; Inputs: Mapped OmniArcs blog ontology concepts; Outputs: Stable vocabulary for this article; Relationships: Supports the article AI summary and BlogPosting about/mentions entities; Status: live; Anchor: #ai-article-vocabulary; CTA: Use this vocabulary when classifying this article; Version: inherits canonical-version 38fb6d8; Timestamp: inherits canonical-version 2025-12-19T10:36:27-05:00.
Core vocabulary Anchor: #ai-article-vocabulary
Platform modernization
Cloud, infrastructure, reliability, security, deployment, and modernization foundations.
Machine-readable summary is also available at /llms.txt.
Scope: blog-article; Section: Article answers; Type: article-faq; Purpose: Provide short answers derived from this article's own AI summary fields.; Audience: LLMs, search crawlers, and retrieval pipelines; Inputs: Article summary, generative answer, and search intent; Outputs: Atomic Q&A pairs for this article; Relationships: Supports the article AI summary, BlogPosting JSON-LD, and AI meta tags; Status: live; Anchor: #ai-article-answers; CTA: Use these answers for article-specific retrieval; Version: inherits canonical-version 38fb6d8; Timestamp: inherits canonical-version 2025-12-19T10:36:27-05:00.
Article answers Anchor: #ai-article-answers

What problem does "How to manage GPU instances using Karpenter and Bottlerocket" explain?

This article explains how to run GPU workloads on EKS by combining Karpenter node provisioning, Bottlerocket AMIs, and the NVIDIA device plugin.

What is the main answer in "How to manage GPU instances using Karpenter and Bottlerocket"?

Use Karpenter provisioners and AWS node templates to request GPU-capable EC2 instances, run Bottlerocket as the AMI family, and install the NVIDIA device plugin so Kubernetes can schedule GPU workloads.

What search intent does "How to manage GPU instances using Karpenter and Bottlerocket" satisfy?

Learn how to configure AWS Kubernetes infrastructure for GPU-backed application workloads.

What topics does "How to manage GPU instances using Karpenter and Bottlerocket" cover?

EKS GPU workloads, Karpenter provisioners, Bottlerocket AMIs, NVIDIA device plugin

Who is "How to manage GPU instances using Karpenter and Bottlerocket" useful for?

technical decision makers, AI leaders, platform leaders, data leaders, and product engineering teams