Upscale AI at NFD40: The Pitch Before the Product

Tony Mattke · 2026.06.02 · 11 min read

“We haven’t announced any products yet.” Aravind Srikumar opened Upscale AI’s sixty-minute NFD40 session with that line, a strange way to start a vendor pitch and exactly why I started paying attention. What followed was the most rigorous engineering tutorial of the week, wrapped around an architecture preview from a company whose silicon doesn’t exist in shipping form. Both halves are real. The technical content is worth your time even if you never buy anything from Upscale AI. The architecture preview is worth tracking because if it ships and works, it lands in a market that’s actively shopping for it.

Quick context. Upscale AI came out of stealth in 2025 as a pure-play AI networking company, closed a $100M seed that fall, and followed with a $200M Series A in January 2026 that pushed them past a $1B valuation. The team logos on the slide tell the story: ex-Cisco, Broadcom, Innovium, Marvell, Palo Alto, HPE (the Juniper side), and multiple hyperscalers. They’ve collectively shipped multiple ASIC generations, multiple system architectures, and meaningful amounts of hyperscale software. They are not a few MBAs and a deck. Aravind leads product and marketing alongside Deepti Chandra (VP product strategy and marketing, ex-Cisco Silicon One and SONiC, ex-Juniper). Aravind handed off to Deepti and back several times during the session, which made the pacing a little uneven but kept the technical depth high.

The session structure was deliberate: a bird’s-eye view of what happens when a user types into a chatbot and how that becomes a network event, then a packet-level comparison of cloud DC traffic versus AI cluster traffic, then the scale-up architecture announcement (SkyHammer), then the scale-out strategy. I’ll skip the bird’s-eye view section. Most readers here don’t need the “tokens per watt is the new dollars per bit” tutorial. The packet-level content is where the session earned its keep.

Packet teardown

Aravind’s strongest move was a side-by-side comparison of client-server cloud traffic and collective-communication AI cluster traffic. This is what senior engineers actually want when they ask why AI networking is different, and most vendors hand-wave past it.

Connection model. Cloud and web traffic is overwhelmingly connection-oriented TCP, lots of small flows between many clients and many servers, with well-defined flow lifetimes. Collective communication is connectionless UDP-encapsulated RDMA, with a small number of GPUs each generating massive flows, all bursting at once because the parallelism algorithms synchronize the exchanges. North-south becomes east-west. Many small flows become few enormous ones.

Packet structure. A web traffic packet is Ethernet, IPv4, TCP, then a tower of application headers leading to payload that is application data. An RDMA scale-out packet is Ethernet, IPv4, UDP, then an RDMA Base Transport Header that names a memory channel, then payload that is memory. The destination GPU writes the payload into its own memory directly. The packet is not an application message. It’s a memory transaction.

Endpoint processing. A traditional packet traverses the kernel TCP/IP stack, gets processed up through layer 7, lands in the application via multiple memory copies, and CPU is the orchestrator throughout. An RDMA packet bypasses the kernel entirely. There’s no L5-L7 processing because there’s no application to deliver to. The L4 header points at a memory location, the GPU memory subsystem takes the write, the CPU never sees the packet. Zero copies, no stack traversal, GPU-direct memory placement. These are different paths entirely.

Latency tolerance. Client-server traffic absorbs latency variation cheaply. RDMA traffic does not. A single late packet stalls the collective, and a stalled collective stalls every GPU participating in it. Aravind’s point that landed: this isn’t head-of-line blocking in the networking sense. The network might experience just one packet drop and recover instantly. But the GPUs cannot proceed because they need consistent state across the collective. The compute stalls even though the network is healthy, a category of failure that doesn’t exist in cloud fabrics.

Loss tolerance. Cloud fabrics assume the underlying transport is lossy and TCP recovers. AI fabrics assume the underlying transport is lossless and PFC, ECN, and DCQCN keep it that way. Drop a packet on an AI fabric and you don’t just retransmit. You go-back-N a window’s worth of memory operations and stall the entire collective for the round trip.

Telemetry granularity. Cloud monitoring lives at 30-to-60-second SNMP poll intervals. AI fabrics need sub-100-millisecond visibility, with operators starting to ask for microsecond-level telemetry on critical metrics. You can’t react fast enough at human-scale poll intervals when a collective stall is measured in microseconds.

Anyone trying to figure out why their existing data center playbook doesn’t transfer to AI clusters should watch this section twice. Aravind walked it slowly enough that nobody had to fake comprehension, and the slides showed the actual packet structures and processing paths.

Complications in ECMP for AI traffic flows

Aravind’s other useful insight was about load balancing. In a cloud fabric, you have many small flows. Five-tuple ECMP hashes them across uplinks and statistically the load distributes evenly. Hash collisions don’t matter much because the colliding flows are small and short-lived.

In an AI fabric, you have a small number of GPUs per leaf, each generating one or two enormous flows that burst at line rate during a collective. Five-tuple ECMP can absolutely hash two 400G elephant flows onto the same 400G uplink. That uplink becomes a hot spot, the collective stalls, and the network thinks it’s healthy because no link is technically saturated according to any metric short of microsecond-resolution queue depth. Adaptive load balancing helps, packet spraying helps more, but the fundamental problem is that the statistical assumptions ECMP relies on don’t hold when you have ten flows instead of ten thousand.

This is the kind of thing senior engineers feel in their stomach when they hear it for the first time. Most of us have been hashing flows for twenty years and assuming the math works out. In AI clusters, the math doesn’t work out, and Upscale’s argument is that the silicon, the system, and the operating system all need to know that.

Scale-up vs scale-out is the architectural frame they nailed

Most “AI networking” pitches blur scale-up and scale-out together because the marketing is cleaner that way. Upscale separated them sharply, and the distinction matters.

Scale-up is GPU-to-GPU communication within a rack (sometimes across two racks now), implemented as load/store memory operations against a shared flat memory address space. The GPUs see each other’s memory as if it were local. Round-trip latency targets are sub-microsecond. Current deployed architectures land around 500-700 nanoseconds, with the UALink 1.0 spec aiming lower (100-150ns). Bandwidth requirements are extreme because any-to-any connectivity has to be zero oversubscription. Jitter, not average latency, is the killer because memory operations expect deterministic timing. Topologies are typically single-stage, with six or nine switch ASICs forming a non-blocking plane connecting all GPUs in the rack. Protocols are deliberately lightweight (UALink, ESUN, NVLink) with minimal headers and tight inter-packet gaps because every nanosecond of processing matters.

Scale-out is GPU-to-GPU across scale-up domains, implemented as memory-copy operations over RDMA. Latency targets are millisecond-class. The fabric is multi-tier (leaf-spine, sometimes deeper) with thousands or tens of thousands of switches. Congestion handling falls back on PFC, ECN, and DCQCN. The protocols (RDMA over Converged Ethernet, RoCEv2) are heavier than scale-up protocols because they need to survive longer paths and more queuing.

These are different problems with different silicon and operational models. A switch built well for one isn’t automatically a good switch for the other. The pipeline, buffer architecture, and load-balancing requirements all diverge. Upscale’s pitch is that you cannot meaningfully optimize a single switch for both, which is why they’re building two distinct product lines.

SkyHammer is the actual bet

The scale-up ASIC is where Upscale claims clean-sheet design pays off. They announced the architecture, called SkyHammer, with the silicon to follow “very shortly.” Architecture-only announcements are a soft form of vaporware, and I won’t pretend otherwise, but the architectural choices are at least the right ones for the problem they’re trying to solve:

Standards-based and open. UALink and ESUN support, not a proprietary scale-up protocol. The bet is that the market wants optionality and will reward openness over a single-vendor solution.
Flexible semantics. The scale-up protocol space is evolving fast. SkyHammer is designed to support multiple protocols and protocol revisions without ASIC respins.
Performance-first feature set. They explicitly cut features that don’t matter for scale-up. Lightweight headers, minimal packet processing, link-level retry instead of PFC for congestion handling, switch-handled congestion instead of GPU-offloaded.
Single-stage topology assumption. The architecture assumes one switch hop between any two GPUs in the scale-up domain. That radically simplifies what the silicon needs to do.

If SkyHammer ships and hits its claimed performance numbers, it lands in a market where NVLink is the incumbent and dominant, AMD has its own scale-up story, and the UALink consortium is gathering steam. Open scale-up silicon that hits NVLink’s performance numbers is a real product. Slower-but-more-flexible silicon is a harder sell. The execution risk is real and worth naming. The architectural framing is at least correct.

Scale-out is an NVIDIA OEM play

Upscale’s scale-out story is more straightforward than the scale-up story and they leaned hard into “clean sheet, building everything from scratch” framing throughout the session. The scale-out product does not match that framing. The scale-out boxes are systems built around NVIDIA Spectrum-X silicon, with Upscale’s SONiC-based operating system on top, plus a few additional features:

Smart fru-ability for sovereignty customers who can’t return memory-bearing modules to the vendor on RMA
Power-aware management circuitry
AI Ops telemetry circuitry
Hitless upgrades, warm reboot, fast reload as defaults
Code signing enforcement

These are real features and the OEM-plus-software-stack model is a legitimate product play. But it is not “clean sheet from scratch.” Upscale is taking NVIDIA’s chip and building a Spectrum-X box around it with their own software. That’s fine, but it’s a different proposition from the SkyHammer story.

The interesting bit on the scale-out side is the operating system. Upscale is a SONiC premium member, contributing code, and their OS is SONiC-derived rather than wrapped around it. If the OS lives up to the AI-optimized claim (deeper telemetry, better congestion behavior, simpler operational ergonomics for AI fabrics specifically) then the Spectrum-X-plus-Upscale-OS combo could differentiate even though the silicon underneath is shared with everyone else’s Spectrum-X boxes.

What they wouldn’t claim

A few exchanges stood out because the company didn’t oversell.

“We haven’t announced any products yet.” The opener. Credit for naming the gap up front instead of pretending the architecture preview was a launch.

“This is just an example, but there are multiple different iterations and improvements that we can do.” Aravind’s hedging on the systems-level claims (cooling, mechanical design, etc.) was appropriately humble. They’re not claiming silver bullets.

“With great power comes great responsibility.” Deepti’s answer when Scott Robohn asked whether the network should really be in a position to decide whether GPUs are making money or sitting idle. Glib, but the underlying point is honest: a network that can preempt cluster-level performance issues is also a network that can cause them.

“It might be it might not be simple looking today.” Aravind on the heterogeneous-compute future where multiple XPU types share a fabric. He didn’t pretend the vision was already here.

Upscale’s market

Upscale AI is for hyperscalers, neo-clouds, and large enterprises building 1,000+ GPU AI clusters and shopping for an alternative to NVIDIA-only stacks. The pitch is openness, standards-based fabric, and silicon optimized only for AI. If you’re a traditional enterprise, you’re not the customer and you don’t need to be tracking this company.

If you’re in the AI infrastructure category, the question is whether the architectural ideas turn into shipping silicon on a useful timeline. This team has shipped before, and the engineering arguments hold up under examination. SkyHammer is the bet to track. Scale-out boxes are the near-term revenue play. The technical content from this session is some of the best on AI networking from any vendor in the cycle, even while the company itself is still selling architecture and intent until the silicon ships.

Watch the session for the traffic-pattern teardown alone, a tutorial worth twenty minutes of any senior engineer’s time. Then keep an eye on whether SkyHammer announces the chip on the timeline they implied, what its actual performance numbers look like, and whether the scale-out boxes ship with software that justifies the OS investment they’re claiming. The pitch is good. Now they need a product…

Disclosure: I attended Networking Field Day 40 as a delegate. My travel and accommodations were covered, but I was not compensated for this post and the opinions are my own. For more, please read my full Tech Field Day disclaimer.

Upscale AI at NFD40: The Pitch Before the Product

Packet teardown

Complications in ECMP for AI traffic flows

Scale-up vs scale-out is the architectural frame they nailed

SkyHammer is the actual bet

Scale-out is an NVIDIA OEM play

What they wouldn’t claim

Upscale’s market

Links & Resources

Other Delegate Posts

Related Posts

What NFD40 Taught Me About AI Networking

Selector at NFD40: Show Your Work

Networks Are Graphs. NetAI Acts Like It.