Networking

Sometimes the Internet Standardizes Reality Instead of Inventing It

Tony Mattke · 2026.07.02 · 7 min read

This month RFC 10005 quietly became a Proposed Standard. It describes the BGP Link Bandwidth Extended Community, a way for a router to tell its neighbors how much capacity sits behind a given path so they can split traffic in proportion to it instead of evenly.

If that sounds familiar, it should. Cisco has shipped a version of this for well over a decade as “BGP DMZ Link Bandwidth.” The IETF draft it grew out of, draft-ietf-idr-link-bandwidth, was first posted in April 2009. It spent seventeen years as a draft, and operators ran it the whole time.

The mechanism is the simple part, the story is in those seventeen years. A standards body sometimes invents something new and sometimes just writes down what everyone is already doing. RFC 10005 is the second kind.

What it actually does

The problem is old and simple. Plain ECMP assumes every path to a destination is interchangeable, so it splits flows evenly across them. That holds when the paths are equal. The link bandwidth community carries a number, the bandwidth behind each next hop, so a router can weight the split. A path with twice the capacity gets roughly twice the traffic.

Plain ECMP splits 600G evenly and overloads the 100G path, while weighted ECMP splits by capacity so all three paths land at the same utilization

RFC 10005 registers it as a two-octet AS-specific extended community, sub-type 0x04, in both transitive and non-transitive forms. It also pins down the edge cases the working group argued about for years: a zero-bandwidth value is valid rather than malformed, and if some paths carry the community while others don’t, you fall back to equal balancing unless local policy says otherwise. Boring details. They are exactly the kind of thing that has to be written down once before two vendors can agree on the behavior.

Why it sat in draft limbo for seventeen years

Because it already worked. Vendors implemented it, operators turned it on, and the de-facto behavior was good enough that nobody felt the pain of it being non-standard. A draft you can already deploy is a draft with no deadline.

The pressure to finish a standard usually comes from interop problems, and for most of those seventeen years there weren’t enough of them. The networks that leaned hardest on weighted load balancing tended to be single-vendor, single-operator, and built by people who could read the draft and the vendor docs and move on. The gap between “draft” and “RFC” only hurts when you have two different boxes that need to agree, and apparently not enough networks were in that spot to force the issue.

What changed underneath

The shape of the networks did.

For most of the last twenty years you could pretend ECMP solved load balancing, because in a uniform fabric it nearly did. Every spine ran at the same speed, every uplink matched its neighbor, and “equal cost” did mean equal capacity. That assumption is quietly coming apart.

Modern fabrics run 400G spines alongside 800G uplinks, break a single port into four logical interfaces, and grow by adding capacity in waves instead of all at once. One spine gets upgraded ahead of the others. A DCI lights four lambdas while the circuit next to it still runs two. The new leaf ships with a 400G uplink while the rest of the row stays at 100G. The moment any of that happens, “equal cost” stops meaning “equal capacity,” and ECMP cheerfully shoves the same share of traffic down a path that can’t hold it.

That is a networking problem, and it predates the current moment by years. Heterogeneous capacity has been creeping into fabrics since the first hyperscaler upgraded half a row and left the other half for next quarter.

Where AI comes in

AI inherited the problem and made getting it wrong a lot more expensive. East-west traffic in an AI back-end fabric dwarfs anything enterprise networks ever pushed, and the flows are large and long-lived rather than the short, bursty connections ECMP’s hashing was tuned for. A single congested path can stall a whole collective operation, and a stalled collective idles every GPU waiting on it. Pour that traffic evenly across paths of unequal capacity and you create exactly the hotspot that costs you a rack of expensive silicon sitting on its hands.

Getting the split wrong used to mean a slow afternoon. Now it can mean idle hardware you are paying for by the hour. Weighted multipath went from something only Google and Meta really cared about to something a lot more operators suddenly need.

Weighted ECMP works at the level of hash buckets. It hands out more of them to the fatter links and leans on the law of large numbers to turn that into a proportional split. With thousands of similar-sized flows, it works beautifully. With a handful of enormous ones, it breaks down, because a flow stays pinned to a single path, and one 100G elephant lands wherever its hash sends it no matter what the weights say. The clean ~86% everywhere in that diagram is the statistical ideal. A real GPU collective lands somewhere far messier.

Reality outran the standard

The link bandwidth community became a standard at the exact moment the dominant new workload turned into the one it serves worst. An AI collective is a few giant, long-lived flows, precisely the case where bucket-based weighting can’t deliver the ratios it promises. So those fabrics are already reaching past it. Weighted ECMP is necessary there and nowhere near sufficient. Seventeen years to standardize the tool, and it lands needing a pile of help for the workload that made it urgent.

The fabric is learning to react, with a handful of techniques that share one instinct: stop trusting a static hash and a capacity number, and move traffic on what the network is doing right now.

  • Adaptive and congestion-aware routing. The fabric watches queue depth and link load and steers around hot spots as they form, instead of pinning a flow to whatever path its hash happened to pick.
  • Flowlet switching. Split a flow only at its natural gaps, so you can rebalance a big flow across paths without reordering packets. It’s what makes adaptive routing safe on elephants.
  • Packet spraying, carefully. Spread one flow across every path packet by packet for near-perfect utilization. It also reorders everything, so it only works when the transport and the NICs are built to reassemble it. The “carefully” is load-bearing.
  • In-network telemetry and explicit congestion signals. All of it depends on switches and endpoints constantly telling each other where the pressure is, which is why ECN and its successors keep getting more capable.

The throughline is a move from the control plane to the data plane. A BGP community is a hint you compute once and advertise. Adaptive routing is a decision the silicon remakes continuously, in hardware. That’s the path InfiniBand took years ago and the one the Ultra Ethernet effort is chasing now, and it’s a different machine from ECMP, weighted or not.

My Take on RFC 10005

Strip away the seventeen years and you’re left with a small, sensible standard. The link bandwidth community is a tool operators have leaned on for ages, and now there’s a documented, vendor-neutral definition two different boxes can agree on. For a mixed-vendor fabric trying to use uneven capacity, that’s genuinely useful. The Internet wrote down what people were already running.

What I keep turning over is the timing. A draft sits for seventeen years and then crosses the line in 2026, the same year every other conversation in networking is about AI fabrics. If the renewed push came from AI operators wanting this, and I honestly don’t know that it did, then something doesn’t add up. Weighted ECMP is a static, control-plane hint, and AI back-end traffic is the exact workload it serves worst: a few enormous flows a bucket-weighting scheme can’t actually split. The fabrics that need the help most are already past it, reaching for adaptive routing and packet spraying and the rest. If RFC 10005 got finished for the AI crowd, it shipped aimed at the wrong problem.

My best guess is the boring one. This got done for interop on traditional and mixed-vendor networks, the AI timing is mostly coincidence, and weighted ECMP keeps doing its modest, useful job in the places it was always good for. But I’m reading tea leaves from outside the IDR working group, and the people who pushed this over the line in 2026 know something I don’t.

Tell me where I’m wrong. If you run a fabric that leans on the link bandwidth community, or you watched this draft finally get standardized and know why it happened now, I want to hear it. Am I underselling what RFC 10005 buys? Overselling the AI angle? Way off the mark entirely? The comment section is open. (Because I wanna hear from everyone)

More in Networking

Related Posts

2026.04.15 Networking 7 min read

Nokia at NFD40: Networking in the AI Era

I’ve been building networks for nearly thirty years. I understand leaf-spine fabrics, BGP design, VRF isolation, ECMP, and congestion management.

2015.02.10 Networking 5 min read

BGP Communities

BGP Communities has to be one of my favorite features added to the BGP protocol.

2014.10.15 Networking 4 min read

AS-Path Filtering

Before we get into the how, let’s talk about the why.