Kubernetes PriorityClass Isn't Enough: Pinning a Pod to AMD Nodes During an ARM Migration

Published: | Category: Kubernetes & EKS
Quick Summary: We started moving our workloads from AMD (x86) nodes to ARM (Graviton) nodes for the lower price and better performance. Our pipelines now build both architectures, but the frontend's multi-arch build was painfully slow - so we decided to keep the frontend on AMD for now. My first instinct was a PriorityClass. It wasn't enough on its own. Here's why, and the full combination that actually works: nodeSelector + PriorityClass + taints/tolerations.

Why We're Moving to ARM

AWS Graviton (ARM) instances are cheaper than their equivalent x86 instances and, for a lot of workloads, faster per dollar. For anyone watching their EKS bill, migrating to ARM is one of the better levers you can pull.

The catch: your container images have to be built for the target architecture. An image built only for amd64 won't run on an arm64 node. So step one of any ARM migration is making your build pipelines produce multi-arch images.

The "Tiny" Pipeline Change That Wasn't So Tiny

Building multi-arch images is, on paper, a one-line change with docker buildx:

docker buildx build \
  --platform linux/amd64,linux/arm64 \
  -t myrepo/app:tag \
  --push .

That single --platform linux/amd64,linux/arm64 tells the build to produce a manifest with both architectures. Once it's pushed, the container runtime on each node automatically pulls the variant that matches the node's CPU. Beautiful.

But there's a cost: you're now building twice. And if your build host is x86, the arm64 half is built under emulation (QEMU), which can be dramatically slower. For most of our services this was fine. For the frontend, though, the build time ballooned - the extra arm64 build turned a quick pipeline into a slow one.

Our decision: Migrate everything else to ARM, but keep the frontend on AMD only for now. No multi-arch frontend build, no slow pipeline. We'd revisit the frontend later (native ARM builders speed this up a lot). The challenge then became: how do we guarantee the frontend always runs on an AMD node?

Attempt 1: Just Use a PriorityClass (Spoiler: Not Enough)

My first thought was a PriorityClass. The idea: make the frontend "more important" than other pods so it always gets a spot on the AMD nodes.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: frontend-high-priority
value: 1000000
globalDefault: false
description: "Frontend wins contention on the limited AMD nodes."

This is useful - but it does not do what I first assumed. Here's the crucial distinction:

What PriorityClass actually does: It controls the order pods are scheduled and whether a pod can preempt (evict) lower-priority pods to make room. It does NOT pin a pod to a particular node or CPU architecture. With only a PriorityClass, nothing stops the frontend from being scheduled onto an ARM node - where its amd64-only image won't even run.

So a PriorityClass answers "who gets scheduled first?" - not "where does this pod run?" Those are two completely different questions, and I was conflating them.

The Real Fix: Three Pieces That Each Do One Job

Keeping the frontend reliably on AMD takes three mechanisms working together. Each solves a different part of the problem.

1nodeSelector - Decides WHERE the pod can land

This is the piece that actually pins the frontend to x86. Kubernetes labels every node with its architecture automatically, so you just select for it:

spec:
  template:
    spec:
      priorityClassName: frontend-high-priority
      nodeSelector:
        kubernetes.io/arch: amd64
      containers:
        - name: frontend
          image: myrepo/frontend:tag   # amd64-only is fine now

With kubernetes.io/arch: amd64, the scheduler will only ever place the frontend on an AMD node. This was the missing piece - PriorityClass could never have done this.

2PriorityClass - Decides WHO wins when AMD nodes are full

Now that the frontend is restricted to AMD nodes, a new risk appears: those AMD nodes are now a scarce resource (we're shrinking them as we move to ARM). If other pods fill them up, the frontend could be stuck Pending.

This is exactly where the PriorityClass earns its keep. When the high-priority frontend can't fit, the scheduler will preempt lower-priority pods on the AMD nodes to make room - and those evicted pods get rescheduled elsewhere (for us, onto the plentiful ARM nodes). So nodeSelector gets the frontend to the AMD nodes, and PriorityClass makes sure it wins the space there.

3Taints & Tolerations - Keep everyone else OFF the AMD nodes

Relying on preemption works, but it's reactive - pods get scheduled, then evicted, which causes churn. The cleaner approach is to stop other pods from landing on the AMD nodes in the first place. That's what taints do: they repel pods unless the pod explicitly tolerates the taint.

Taint the AMD nodes (or set it on the AMD node group / Karpenter NodePool):

kubectl taint nodes <amd-node> workload=frontend:NoSchedule

Then let only the frontend tolerate it:

      tolerations:
        - key: "workload"
          operator: "Equal"
          value: "frontend"
          effect: "NoSchedule"

Now the AMD nodes are effectively reserved for the frontend: other pods are repelled by the taint, and the frontend lands there thanks to its toleration + nodeSelector. The PriorityClass becomes a safety net rather than the primary mechanism.

The mental model that finally made it click:
  • nodeSelector / affinity = where a pod is allowed to go (attraction)
  • Taints / tolerations = which pods a node repels (reservation)
  • PriorityClass = who gets scheduled first and who can evict whom (order)
They're three different questions. The reason "just a PriorityClass" failed is that it only answers the third one.

Gotchas Worth Knowing

  • Don't taint your AMD nodes without checking system pods. DaemonSets and critical add-ons need to tolerate the taint or run elsewhere, or you'll break things like logging/monitoring agents on those nodes.
  • Preemption can cause churn. If you lean on PriorityClass-driven preemption instead of taints, expect lower-priority pods to be evicted and rescheduled. Use preemptionPolicy: Never if you want priority ordering without evicting others.
  • Keep priority values sane. Don't set your frontend above the reserved system priority classes (system-cluster-critical, system-node-critical) - you don't want app pods preempting core cluster components.
  • This is a transition state. The end goal is still a native multi-arch frontend on ARM. Pinning to AMD is a bridge, not a destination.

Frequently Asked Questions

Does a PriorityClass control which node a pod runs on?

No. PriorityClass only affects scheduling order and preemption. To control which node or architecture a pod runs on, use a nodeSelector or node affinity.

How do I keep a pod on amd64 nodes during an ARM migration?

Add nodeSelector: kubernetes.io/arch: amd64 to pin it to x86 nodes. Add a high PriorityClass so it wins contention on the limited AMD capacity, and taint the AMD nodes with a matching toleration on the pod to reserve them.

What's the difference between nodeSelector, taints, and PriorityClass?

nodeSelector/affinity controls where a pod can go. Taints/tolerations repel pods from nodes to reserve them. PriorityClass controls scheduling order and preemption. Different jobs - usually used together.

Why not just build the frontend for ARM too?

Eventually you should. We paused it only because the multi-arch frontend build was too slow under emulation. With a native ARM build runner the build time drops and the frontend can move to ARM like everything else.

My Takeaway

The lesson from this migration was simple but easy to get wrong: scheduling priority and pod placement are not the same thing. A PriorityClass will never keep a pod on a particular architecture - it just decides who goes first. To actually pin our frontend to AMD nodes, the nodeSelector did the placement, taints reserved the capacity, and the PriorityClass was the safety net for contention.

If you're partway through an ARM (Graviton) migration and need certain workloads to stay on x86 for a while, reach for all three - and know which problem each one is solving.

For the official details, see the Kubernetes docs on Pod Priority and Preemption, assigning pods to nodes, and taints and tolerations.