What causes a QPS limit exceeded error when starting an EKS cluster?

When a scaled-to-zero cluster comes back up, every node and pod schedules at the same moment and pulls container images simultaneously. This thundering herd of image pulls overwhelms the registry (ECR) API and the kubelet's own pull rate limits, producing throttling and QPS limit exceeded errors.

How do I fix image pull throttling on EKS?

Tune the kubelet (registryPullQPS, registryBurst, serializeImagePulls), use an ECR pull-through cache with a VPC interface endpoint, pre-pull or bake hot images into the node AMI, and stagger the scale-up so workloads come back in phases instead of all at once.

Is scaling a dev EKS cluster to zero at night worth it?

Yes. A dev cluster idle 12+ hours a day plus weekends is running roughly 168 hours a week when it only needs about 50. Scaling node groups to zero outside business hours can cut compute cost by 60-70 percent. The start-up thundering herd is the main gotcha, and it is solvable.

QPS Limit Exceeded on EKS Start-up: The Image Pull Thundering Herd

Published: June 6, 2026 | Category: Kubernetes & EKS

                    Quick Summary: To save money I started scaling our dev EKS cluster down to zero nodes outside business hours and back up every morning. The very first morning, the cluster came back up and immediately threw a "QPS limit exceeded" error. The cause wasn't the start/stop automation itself - it was every pod in the cluster trying to pull its container image at the exact same second. Here's what happened and how I fixed the thundering herd.
                

Why I Started Stopping the Dev Cluster at Night

Our development EKS cluster was running 24/7. But nobody touches a dev environment at 2 AM, and nobody is working on it over the weekend. Yet the node groups kept running the whole time, quietly burning EC2 spend for nothing.

The math is hard to ignore. There are 168 hours in a week, but a dev cluster realistically only needs to be available during business hours - roughly 50 hours a week (10 hours a day, 5 days a week). That means for about 70% of the week we were paying for compute that no one used.

So I built a simple scheduled job: scale the node groups (and the workloads on them) down to zero at 8 PM, and scale them back up at 8 AM on weekdays. The control plane stays up - AWS charges a flat hourly rate for that regardless - but the expensive part, the worker nodes, goes to zero.

💰 The Idea in One Line

Before: node groups running ~168 hours/week.

After: node groups running ~50 hours/week.

Result: roughly 60-70% lower worker-node compute cost on the dev account, with no impact on developers during the day.

Then the Cluster Woke Up Angry

The automation worked perfectly going down. The problem showed up going up. The first morning the cluster scaled back from zero, a wave of pods got stuck in ContainerCreating, and the events were full of the same message:

The error I saw:

Failed to pull image "xxxxxx.dkr.ecr.ap-south-1.amazonaws.com/my-app:latest":
rpc error: code = Unknown desc = failed to pull and unpack image:
failed to copy: httpReadSeeker: failed open:
unexpected status code ... 429 Too Many Requests

Warning  Failed   kubelet  Error: ErrImagePull
Warning  Failed   kubelet  ... QPS limit exceeded / Rate exceeded

At first I blamed the start/stop script. But the script had done its job - the nodes were healthy and pods were being scheduled. The real culprit was timing.

Root Cause: Everything Pulls at Once

When a cluster runs normally, pods start at different times - a deploy here, a restart there. Image pulls are naturally spread out, so nobody notices any limits.

But when you bring a cluster back from zero, that smooth spread collapses into a single spike:

All node groups scale up together, so a batch of fresh nodes joins within the same minute.
Every node starts with an empty image cache - nothing is pre-pulled because the nodes are brand new.
The scheduler places every pending pod from every namespace as fast as it can.
So every kubelet, on every node, fires off image pull requests to the registry at the same second.

This is a classic thundering herd. Two different rate limits get hit at once:

Registry-side throttling (ECR): Amazon ECR enforces request rate limits on the API calls used during a pull (GetDownloadUrlForLayer, BatchGetImage, GetAuthorizationToken). Hundreds of simultaneous pulls blow past those limits and ECR returns 429 Too Many Requests / "Rate exceeded".
Node-side throttling (kubelet): Each kubelet also rate-limits how fast it talks to the registry via registryPullQPS and registryBurst. Under a flood, the kubelet's own queue backs up and surfaces "QPS limit exceeded".

                    The key insight: The error has nothing to do with your images being broken or your registry being down. It is purely a concurrency problem - too many pulls in too short a window. Fix the concurrency and the error disappears.
                

How I Fixed It (4 Steps)

1Stagger the Scale-Up Instead of Big-Banging It

The single most effective fix was to stop bringing everything back at once. Instead of scaling all node groups to full size in one shot, I brought capacity back in phases - a few nodes first, wait a couple of minutes for their images to land, then the rest.

On the workload side, the same idea applies: don't un-pause every deployment simultaneously. Restoring critical namespaces first and the rest a few minutes later spreads the pulls over time and keeps each burst under the registry's limit.

2Tune the kubelet Pull Limits

The kubelet defaults are conservative (registryPullQPS: 5, registryBurst: 10, and serialized pulls). For a cluster that intentionally does mass start-ups, I adjusted these so a node can pull a bit more aggressively without serializing every layer one at a time:

# kubelet config (via EKS launch template / nodeadm / custom AMI)
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
serializeImagePulls: false   # allow parallel pulls per node
registryPullQPS: 10          # default is 5
registryBurst: 20            # default is 10

A word of caution: turning these up on the node while the registry is the bottleneck can make ECR throttling worse, not better. So I paired this with step 3 - reducing how often we hit ECR at all.

3Put a Pull-Through Cache in Front of ECR

I set up an ECR pull-through cache and made sure the cluster reaches ECR over a VPC interface endpoint (plus the S3 gateway endpoint, since image layers live in S3). This does two things: it keeps pull traffic inside the VPC instead of going out over a NAT gateway, and the cache means repeated pulls of the same image hit a warm copy instead of re-fetching from the upstream registry every time.

This is especially valuable for public images (e.g. Docker Hub), which have their own aggressive anonymous pull-rate limits that a cold cluster start can trip instantly.

4Pre-Pull the Hot Images

The deepest fix is to make sure nodes don't start with an empty cache. Two ways to do this:

Bake images into a custom AMI: pre-pull your most common base and app images into the node image so they're already on disk the moment the node boots.
Run an image pre-puller DaemonSet: a lightweight DaemonSet that pulls the heavy images onto every node ahead of the real workloads, smoothing the spike.

Either way, fewer cold pulls means a far smaller thundering herd when the cluster comes back to life.

The Result

The QPS limit exceeded / 429 errors disappeared on subsequent morning start-ups.
Pods reached Running faster because pulls were no longer fighting each other.
We kept the full cost savings of scaling to zero at night - without the painful wake-up.

                    Pro Tip: If you only do one thing, stagger the scale-up. Most "QPS limit exceeded" start-up failures vanish the moment you stop bringing the entire cluster back in a single burst.
                

Frequently Asked Questions

What causes "QPS limit exceeded" when starting an EKS cluster?

It's a concurrency problem. When a scaled-to-zero cluster comes back, every node is fresh with an empty image cache and every pod is scheduled at once, so they all pull images simultaneously. That flood exceeds both Amazon ECR's API rate limits and the kubelet's own pull QPS limits, producing throttling and 429 errors.

Does this mean my images or registry are broken?

No. The images and the registry are fine. The error is purely about too many pulls happening in too short a time window. Spreading those pulls out makes it go away.

How do I reduce image pull throttling on EKS?

Stagger the scale-up so workloads return in phases, tune the kubelet (registryPullQPS, registryBurst, serializeImagePulls), use an ECR pull-through cache with a VPC endpoint, and pre-pull hot images via a custom AMI or a DaemonSet.

Is scaling a dev cluster to zero at night actually worth it?

For non-production environments, almost always. A dev cluster only needs to be up during business hours - roughly 50 of the 168 hours in a week - so scaling worker nodes to zero the rest of the time can cut compute cost by 60-70%. The start-up thundering herd is the main gotcha, and it's entirely solvable.

My Takeaway

Scaling a dev cluster to zero overnight is one of the easiest cost wins in Kubernetes - but "scale to zero" quietly changes the start-up from a trickle into a flood. The "QPS limit exceeded" error was the cluster telling me it was trying to do a whole day's worth of image pulls in one second. Once I stopped big-banging the start-up and gave ECR some breathing room with a cache and pre-pulled images, the mornings got quiet again.

For more detail, see the AWS docs on ECR pull-through cache and the Kubernetes reference for kubelet configuration.

srinun.in