How to manage GPU instances using Karpenter and Bottlerocket

AI tools are everywhere now and in order to have them do their thing with acceptable performance they need to use the power of GPU enabled instances, in this article I’ll tell you how we are achieving this.

The Omniarcs team is building and supporting the The little AI assistant that could - Locus Extension. **So to see how the application of this post is running in production, please download and try Locus!

What is Karpenter?

“Karpenter is an open-source node provisioning project built for Kubernetes. Adding Karpenter to a Kubernetes cluster can dramatically improve the efficiency and cost of running workloads on that cluster”

The idea of Karpenter in short is that instead of having a node group for your Kubernetes Cluster you let Karpenter manage that by giving it options of what type of instances and resources can it use for that node group that can scale depending on the requirements of your pods.

Here is a terraform example of setting an EKS cluster with Karpenter enabled

The interesting part of Karpenter are the provisioners, we’ll get into that later.

What is Bottlerocket?

Bottlerocket is a Linux-based open-source operating system that is purpose-built by Amazon Web Services for running containers.

We use Bottlerocket in Karpenter as the family of AMIs our provisioners use, you can read about it here, but basically it is the AMI that Karpenter will use to create the instance that the pods will run on.

How to enable GPU?

You have some options here, you can use the nvidia gpu operator or the nvidia device plugin, for now we have choosen the plugin

Putting everything together

Now lets get into the details of each bit

Karpenter provisioner

This is the blueprint of what your instances will look like, so we can define, instance size, cpu, family, generation, architecture, etc. You can be as specific or generic as you like, make sure you use Karpenter docs to define stuff that is not contradicting itself like an arch that does not work on a specific instance type (it happens!)

resource "kubectl_manifest" "karpenter_provisioner_models" {
  yaml_body = <<-YAML
    apiVersion: karpenter.sh/v1alpha5
    kind: Provisioner
    metadata:
      name: models
    spec:
      labels:
        compute: models
        gpu-type: nvidia
      requirements:
        - key: "karpenter.k8s.aws/instance-category"
          operator: In
          values: ["g"]
        - key: "karpenter.k8s.aws/instance-cpu"
          operator: In
          values: ["4"]
        - key: "karpenter.k8s.aws/instance-size"
          operator: In
          values: ["2xlarge","xlarge"]
        - key: "karpenter.k8s.aws/instance-family"
          operator: In
          values: ["g4dn"]
        - key: "karpenter.k8s.aws/instance-generation"
          operator: In
          values: ["4"]
        - key: "karpenter.k8s.aws/instance-hypervisor"
          operator: In
          values: ["nitro"]
        - key: "topology.kubernetes.io/zone"
          operator: In
          values: ${jsonencode(local.azs)}
        - key: "kubernetes.io/arch"
          operator: In
          values: ["amd64"]
        - key: "karpenter.sh/capacity-type" # If not included, the webhook for the AWS cloud provider will default to on-demand
          operator: In
          values: ["spot", "on-demand"]
      kubeletConfiguration:
        containerRuntime: containerd
        maxPods: 110
      limits:
        resources:
          cpu: 1000
      consolidation:
        enabled: true
      providerRef:
        name: models
      ttlSecondsUntilExpired: 604800 # 7 Days = 7 * 24 * 60 * 60 Seconds
  YAML

  depends_on = [
    module.eks_blueprints_kubernetes_addons
  ]
}

Other handy resource is a node template, it is most likely that you are using python and that your docker container is huge, so you’ll need to add more disk to the instance, this is how you do that

resource "kubectl_manifest" "karpenter_provisioner_models_node_template" {
  yaml_body = <<-YAML
    apiVersion: karpenter.k8s.aws/v1alpha1
    kind: AWSNodeTemplate
    metadata:
      name: models
    spec:
      amiFamily: Bottlerocket
      subnetSelector:
        karpenter.sh/discovery: "${module.eks.cluster_name}"
      securityGroupSelector:
        karpenter.sh/discovery: "${module.eks.cluster_name}"
      blockDeviceMappings:
        - deviceName: /dev/xvdb
          ebs:
            volumeType: gp3
            volumeSize: 60Gi
            deleteOnTermination: true
  YAML

  depends_on = [
    module.eks_blueprints_kubernetes_addons
  ]
}

Stay tuned for other articles we’ll put out on how we moved a lot of our code where appropriate from Python to Rust to get our containers a lot smaller and faster.

Enabling GPU

If you are using the EKS blueprints you’ll need to add this

enable_nvidia_device_plugin = true
  nvidia_device_plugin_helm_config = {
    values = [
      <<-EOT
          nodeSelector:
            gpu-type: nvidia
        EOT
    ]
  }

This will tell the plugin to run on the instances with the gpu-type label

Making sure it all works

At this point your cluster is running, you have a bunch of pods running in kube-system and you have karpenter pods running in the karpenter namespace, but there are no instances and no plugin pods.

If you are using the blueprints your setup is using fargate for the base cluster and instances will be used for other pods, depending on your needs this might be a good idea or not.

But lets launch a pod that can help us see things are ok with the following one liner

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
EOF

We are launching a container that will help us debug if the instance has the GPU enabled or not.

It is very important to add the gpu limits on the container section to the pod or deployment.

Now lets see the output

kubectl describe pods gpu-pod

Your logs should be something like this

➜  ~ kubectl logs gpu-pod
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

You can also check the logs of the nvidia-plugin pod and it should look like this

 ~ kubectl logs nvidia-device-plugin-m92hq -n nvidia-device-plugin
2023/08/03 14:37:13 Starting FS watcher.
2023/08/03 14:37:13 Starting OS watcher.
2023/08/03 14:37:13 Starting Plugins.
2023/08/03 14:37:13 Loading configuration.
2023/08/03 14:37:13 Initializing NVML.
2023/08/03 14:37:13 Updating config with default resource matching patterns.
2023/08/03 14:37:13
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/",
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": "envvar",
      "deviceIDStrategy": "uuid"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
2023/08/03 14:37:13 Retreiving plugins.
2023/08/03 14:37:13 Starting GRPC server for 'nvidia.com/gpu'
2023/08/03 14:37:13 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2023/08/03 14:37:13 Registered device plugin for 'nvidia.com/gpu' with Kubelet

On the karpenter pods logs you should also see the instance being provisioned and after the job completes deprovisioned

2023-08-03T14:35:42.015Z INFO controller.provisioner launching machine with 1 pods requesting {"cpu":"155m","memory":"120Mi","nvidia.com/gpu":"1","pods":"6"} from types g4dn.xlarge {"commit": "5a2fe84-dirty", "provisioner": "models"}
...
2023-08-03T14:35:44.531Z INFO controller.provisioner.cloudprovider launched instance {"commit": "5a2fe84-dirty", "provisioner": "models", "id": "i-XXXXXXXXX", "hostname": "ip-172-31-XX-X.us-east-X.compute.internal", "instance-type": "g4dn.xlarge", "zone": "us-east-Xb", "capacity-type": "spot", "capacity": {"cpu":"4","ephemeral-storage":"60Gi","memory":"15155Mi","nvidia.com/gpu":"1","pods":"110"}}
...
2023-08-03T14:37:18.043Z INFO controller.deprovisioning deprovisioning via consolidation delete, terminating 1 machines ip-172-31-16-11.us-east-2.compute.internal/g4dn.xlarge/spot {"commit": "5a2fe84-dirty"}

You should have a cluster configured in a way that can

Run applications that require GPU compute
Can launch spot or on-demand instances and terminate them once they are not needed based on the provisioner configuration.

Conclusion

GPU enabled cluster will be required for a lot of the solutions being built currently, there are a ton of options on how to do this, hopefully this gets you in the right direction.

Even more Karpenter offers a way to add flexibility without loosing control of the resources and spend associated with a kubernetes cluster with these requirements.

We will probably see easier ways to accomplish this in the future!

How to manage GPU instances using Karpenter and Bottlerocket

Eduardo Lugo

What is Karpenter?

What is Bottlerocket?

How to enable GPU?

Putting everything together

Karpenter provisioner

Enabling GPU

Making sure it all works

Conclusion

Latest Stories

How to fine tune IAM policies used by Terraform

The Efficient Unified Analytic Platform

How to create optional resources that depend on modules using Terraforms for_each