KubernetesCloud Native InfrastructureDevOps EngineeringJune 19, 20264 min read

Implementing Kubernetes Node Auto-Provisioning: Karpenter and Bottlerocket

Master Kubernetes node provisioning with Karpenter and Bottlerocket. Learn to optimize your cloud native infrastructure for speed, cost, and security.

KubernetesKarpenterBottlerocketAWSEKSInfrastructure as CodeDevOps

During a recent sprint, our EKS cluster hit a wall. We were running a massive batch processing job that required spinning up 400 pods, but our standard Cluster Autoscaler was lagging behind by roughly 8 minutes. By the time the nodes actually joined the cluster, the job had already timed out, costing us around $1,200 in wasted compute cycles and a missed SLA. We realized the traditional autoscaler was just too slow at reconciling node groups, so we moved to Kubernetes Autoscaling: Karpenter vs Cluster Autoscaler Guide to handle our dynamic workloads.

Streamlining Kubernetes Node Provisioning

The move to Karpenter wasn't just about speed; it was about granular control. Karpenter doesn't rely on pre-defined node groups. Instead, it observes the aggregate resource requests of unschedulable pods and makes direct calls to the EC2 fleet API. To secure the underlying OS, we paired it with Bottlerocket, an AWS-provided Linux-based OS purpose-built for hosting containers.

We first attempted to use standard Amazon Linux 2 AMIs with Karpenter. It broke because our security team required strict CIS benchmarks, and managing those configurations across thousands of ephemeral nodes became a configuration drift nightmare. Switching to Bottlerocket simplified this because the OS is read-only and lacks a traditional package manager, forcing us to handle security via Kubernetes Security: Implementing Zero-Trust with Kyverno and Policies.

The Configuration Setup

To get started, you need to define an EC2NodeClass and a NodePool. Here is how we configured our initial setup:


YAML
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: Bottlerocket
  role: "KarpenterNodeRole"
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "my-cluster"
---
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      nodeClassRef:
        name: default
      requirements:
        - key: "karpenter.sh/capacity-type"
          operator: In
          values: ["spot", "on-demand"]

Once we applied these manifests, the difference was immediate. Pods that previously waited nearly 10 minutes for capacity were now scheduling in under 90 seconds.

Why Bottlerocket Matters

Bottlerocket is a game-changer for Cloud Native Infrastructure. Because it’s stripped down, the attack surface is significantly smaller. When we need to perform updates, we don't patch; we roll out a new node version and drain the old ones. If you're looking for further isolation, you can even pair this with Kubernetes Security: Hardening Runtimes with gVisor and Kata to ensure that even if a container is compromised, the host remains untouched.

We encountered one significant hurdle during implementation: our logging agent (FluentBit) required specific kernel parameters that Bottlerocket didn't expose by default. We had to write a custom user-data script to inject these settings during the node bootstrap phase. It wasn't the clean "plug-and-play" experience I expected, but it forced us to be more deliberate about our node configurations.

Frequently Asked Questions

Close-up of a magnifying glass focusing on the phrase 'Frequently Asked Questions'.

Q: Is Karpenter compatible with standard EKS managed node groups? A: Yes, you can run both simultaneously. We use managed node groups for our control plane and core services, and Karpenter for our bursty, ephemeral workloads.

Q: Does Bottlerocket require special management tools? A: Not strictly. It supports standard Kubernetes APIs, but you should use the Bottlerocket API for host-level tasks if you really need to drill down into the node.

Q: What happens if Karpenter fails to provision a node? A: Karpenter logs errors directly to the controller pod. We’ve set up Prometheus alerts to notify us if the karpenter_provisioner_scheduling_duration_seconds metric exceeds our internal threshold, which usually happens when we hit AWS account service quotas.

The Reality of Production

Children exploring virtual reality in a professional photo studio setting.

I'm still not entirely convinced that our current TTL (Time-To-Live) settings for nodes are optimal. We’re currently set to terminate underutilized nodes after 30 minutes, but I suspect we might be incurring unnecessary churn during periods of low traffic. Next time, I’d like to experiment with a more aggressive consolidation policy, but for now, we’re focusing on stability. If you're managing complex stateful sets, remember that Karpenter doesn't magically solve data persistence issues—you'll still need robust solutions like Kubernetes Backup Strategies: Implementing Velero and MinIO to handle your volume snapshots before the nodes disappear.

Scaling Kubernetes Autoscaling with Karpenter and Bottlerocket has fundamentally changed how we view our cloud bill. We no longer over-provision for peak capacity; we provision for reality. It's a tighter loop, a smaller footprint, and frankly, a lot less headache during on-call rotations.

Back to Blog

Implementing Kubernetes Node Auto-Provisioning: Karpenter and Bottlerocket

Streamlining Kubernetes Node Provisioning

The Configuration Setup

Why Bottlerocket Matters

Frequently Asked Questions

The Reality of Production

Similar Posts

WordPress Kubernetes Multisite: Solving Storage and Database Persistence

Implementing Laravel Pulse for Real-Time Infrastructure Monitoring

Scaling Laravel Queues on Kubernetes: A KEDA Implementation Guide