Why I Started Stopping the Dev Cluster at Night
Our development EKS cluster was running 24/7. But nobody touches a dev environment at 2 AM, and nobody is working on it over the weekend. Yet the node groups kept running the whole time, quietly burning EC2 spend for nothing.
The math is hard to ignore. There are 168 hours in a week, but a dev cluster realistically only needs to be available during business hours - roughly 50 hours a week (10 hours a day, 5 days a week). That means for about 70% of the week we were paying for compute that no one used.
So I built a simple scheduled job: scale the node groups (and the workloads on them) down to zero at 8 PM, and scale them back up at 8 AM on weekdays. The control plane stays up - AWS charges a flat hourly rate for that regardless - but the expensive part, the worker nodes, goes to zero.
💰 The Idea in One Line
Before: node groups running ~168 hours/week.
After: node groups running ~50 hours/week.
Result: roughly 60-70% lower worker-node compute cost on the dev account, with no impact on developers during the day.
Then the Cluster Woke Up Angry
The automation worked perfectly going down. The problem showed up going up. The first morning the cluster scaled back from zero, a wave of pods got stuck in ContainerCreating, and the events were full of the same message:
Failed to pull image "xxxxxx.dkr.ecr.ap-south-1.amazonaws.com/my-app:latest":
rpc error: code = Unknown desc = failed to pull and unpack image:
failed to copy: httpReadSeeker: failed open:
unexpected status code ... 429 Too Many Requests
Warning Failed kubelet Error: ErrImagePull
Warning Failed kubelet ... QPS limit exceeded / Rate exceeded
At first I blamed the start/stop script. But the script had done its job - the nodes were healthy and pods were being scheduled. The real culprit was timing.
Root Cause: Everything Pulls at Once
When a cluster runs normally, pods start at different times - a deploy here, a restart there. Image pulls are naturally spread out, so nobody notices any limits.
But when you bring a cluster back from zero, that smooth spread collapses into a single spike:
- All node groups scale up together, so a batch of fresh nodes joins within the same minute.
- Every node starts with an empty image cache - nothing is pre-pulled because the nodes are brand new.
- The scheduler places every pending pod from every namespace as fast as it can.
- So every kubelet, on every node, fires off image pull requests to the registry at the same second.
This is a classic thundering herd. Two different rate limits get hit at once:
- Registry-side throttling (ECR): Amazon ECR enforces request rate limits on the API calls used during a pull (
GetDownloadUrlForLayer,BatchGetImage,GetAuthorizationToken). Hundreds of simultaneous pulls blow past those limits and ECR returns429 Too Many Requests/ "Rate exceeded". - Node-side throttling (kubelet): Each kubelet also rate-limits how fast it talks to the registry via
registryPullQPSandregistryBurst. Under a flood, the kubelet's own queue backs up and surfaces "QPS limit exceeded".
How I Fixed It (4 Steps)
1Stagger the Scale-Up Instead of Big-Banging It
The single most effective fix was to stop bringing everything back at once. Instead of scaling all node groups to full size in one shot, I brought capacity back in phases - a few nodes first, wait a couple of minutes for their images to land, then the rest.
On the workload side, the same idea applies: don't un-pause every deployment simultaneously. Restoring critical namespaces first and the rest a few minutes later spreads the pulls over time and keeps each burst under the registry's limit.
2Tune the kubelet Pull Limits
The kubelet defaults are conservative (registryPullQPS: 5, registryBurst: 10, and serialized pulls). For a cluster that intentionally does mass start-ups, I adjusted these so a node can pull a bit more aggressively without serializing every layer one at a time:
# kubelet config (via EKS launch template / nodeadm / custom AMI)
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
serializeImagePulls: false # allow parallel pulls per node
registryPullQPS: 10 # default is 5
registryBurst: 20 # default is 10
A word of caution: turning these up on the node while the registry is the bottleneck can make ECR throttling worse, not better. So I paired this with step 3 - reducing how often we hit ECR at all.
3Put a Pull-Through Cache in Front of ECR
I set up an ECR pull-through cache and made sure the cluster reaches ECR over a VPC interface endpoint (plus the S3 gateway endpoint, since image layers live in S3). This does two things: it keeps pull traffic inside the VPC instead of going out over a NAT gateway, and the cache means repeated pulls of the same image hit a warm copy instead of re-fetching from the upstream registry every time.
This is especially valuable for public images (e.g. Docker Hub), which have their own aggressive anonymous pull-rate limits that a cold cluster start can trip instantly.
4Pre-Pull the Hot Images
The deepest fix is to make sure nodes don't start with an empty cache. Two ways to do this:
- Bake images into a custom AMI: pre-pull your most common base and app images into the node image so they're already on disk the moment the node boots.
- Run an image pre-puller DaemonSet: a lightweight DaemonSet that pulls the heavy images onto every node ahead of the real workloads, smoothing the spike.
Either way, fewer cold pulls means a far smaller thundering herd when the cluster comes back to life.
The Result
- The
QPS limit exceeded/429errors disappeared on subsequent morning start-ups. - Pods reached
Runningfaster because pulls were no longer fighting each other. - We kept the full cost savings of scaling to zero at night - without the painful wake-up.
Frequently Asked Questions
What causes "QPS limit exceeded" when starting an EKS cluster?
It's a concurrency problem. When a scaled-to-zero cluster comes back, every node is fresh with an empty image cache and every pod is scheduled at once, so they all pull images simultaneously. That flood exceeds both Amazon ECR's API rate limits and the kubelet's own pull QPS limits, producing throttling and 429 errors.
Does this mean my images or registry are broken?
No. The images and the registry are fine. The error is purely about too many pulls happening in too short a time window. Spreading those pulls out makes it go away.
How do I reduce image pull throttling on EKS?
Stagger the scale-up so workloads return in phases, tune the kubelet (registryPullQPS, registryBurst, serializeImagePulls), use an ECR pull-through cache with a VPC endpoint, and pre-pull hot images via a custom AMI or a DaemonSet.
Is scaling a dev cluster to zero at night actually worth it?
For non-production environments, almost always. A dev cluster only needs to be up during business hours - roughly 50 of the 168 hours in a week - so scaling worker nodes to zero the rest of the time can cut compute cost by 60-70%. The start-up thundering herd is the main gotcha, and it's entirely solvable.
My Takeaway
Scaling a dev cluster to zero overnight is one of the easiest cost wins in Kubernetes - but "scale to zero" quietly changes the start-up from a trickle into a flood. The "QPS limit exceeded" error was the cluster telling me it was trying to do a whole day's worth of image pulls in one second. Once I stopped big-banging the start-up and gave ECR some breathing room with a cache and pre-pulled images, the mornings got quiet again.
For more detail, see the AWS docs on ECR pull-through cache and the Kubernetes reference for kubelet configuration.