Switching

AI Networks Can't Use SDN. Netris Built What's Next.

Tony Mattke · 2026.05.12 · 11 min read

Alex Saroyan had about ninety seconds onstage before he made the architectural claim that the rest of the AI-networking conversation has been ducking. SDN won the data center virtualization fight a decade ago. NSX, ACI, and the whole “network in software” story carried the day for traditional virtualized workloads, and most of us got used to drawing the abstraction layer in software. None of that works for AI clusters. The bandwidth is wrong by an order of magnitude, the path through the kernel is too slow, and the GPU itself never wanted to talk to a virtual switch in the first place. So the multi-tenant cloud abstraction layer that tenants actually consume cannot live in software networking the way NSX did. It has to sit above a physical-fabric automation layer that’s smart enough to behave like a cloud.

That layer is what Netris is calling NAAM (Network Automation, Abstraction, Multi-tenancy). It’s a clunky acronym they invented and are hoping the industry picks up. The marketing is doing it no favors. The architecture underneath, on the other hand, is the most credible “what comes after SDN” pitch I’ve seen this year, and it’s worth the time to understand, even if you never end up buying the product.

Quick context on who Netris is. Alex is the CEO and co-founder. The company started life automating private-cloud and on-prem data center fabrics, then pivoted hard into the neo-cloud and AI-factory market about eighteen months ago. They’re now NVIDIA-validated, claim 12% of the neo-cloud market by cluster count, launched roughly twenty pure-AI clusters in the last year, and grew revenue 800% doing it. Take the growth numbers with the usual grain of salt, but the customer logos on the slide aren’t POCs. They’re operators running between 1,000 and 50,000 GPUs in production today, with most of them planning the path to 100k and beyond.

Scale forces the architecture

Alex showed a network diagram for a “small” 512-GPU cluster early in the deck and apologized for it being unrealistically tiny. The smallest customer cluster they work with is bigger. That tiny cluster has 1,244 network links. The eighteen-thousand-GPU cluster he referenced later has 256 racks, roughly 2,000 switches, and eight parallel fabrics: a front-end network, four rail-optimized backend planes (quad-plane via optical splitters at the SFP), NVLink rack-scale switching (nine additional switches per rack just to interconnect 72 GPUs), a DPU host fabric, and an edge layer.

Reconfiguring 1,200 links a day across multiple tenants is not a problem you solve with Ansible playbooks. Reconfiguring 50,000 links a day is not a problem you solve with anything that looks like a centralized config-push tool. The architectural choices fall out of the math. You can’t operate this with the tools the rest of us have been using to manage data centers, and pretending otherwise is how operators end up two years into a custom in-house automation project that still doesn’t work.

Multi-tenancy is mandatory even when you only have one tenant

This was the cleanest insight of the session and the one I keep coming back to. When a GPU fails inside a tenant’s allocation, the neo-cloud operator cannot legally access the tenant’s server to swap the GPU. The data on it belongs to the tenant. So the operator has to move that server into a separate service tenant first, replace the GPU, reinstall the OS, prove it works, and move the server back to the original tenant. A single-tenant cluster, therefore, requires at least two tenants. Multi-tenancy isn’t a feature you add when your customer count grows. It’s a structural requirement of running anybody’s GPUs but your own, and any automation layer that bolts multi-tenancy on later has already lost.

This is the kind of insight you only get from the people actually running the racks. It’s also the kind of insight that makes “lift and shift your existing data center automation” pitches look hopeless.

No configs in the controller

This is the architectural choice that matters most, and Alex came back to it several times. The Netris controller stores topology, IPAM, intent, policies, and the desired state of the network in object form. It never stores device configurations. Every switch runs a Netris agent. The agent reads the controller’s high-level state, knows what the switch is supposed to be (its role in the topology, its neighbors, the policies that apply), and generates the right config for whatever NOS and ASIC the box happens to ship with. Replace a Cumulus switch with an Arista switch, and the agent regenerates the config from scratch. There’s no config to migrate because there was never a config to begin with.

“Switch dies, you replace the switch, and the agent is again, oh my gosh, empty switch. Let me let me let me generate the config.”

— Alex Saroyan

That’s a real shift in operating model, not a marketing bullet. It also explains why one of the delegates asked the obvious enterprise-engineer question: “Do you back up your switch configs?”. The answer was clean. No. There’s nothing to back up. You back up the controller. The agents will rebuild every config they need from scratch the next time the box boots. That answer would horrify a traditional NOC and is exactly correct for a cluster running thousands of switches.

The other consequence: the agent is a distributed system. There’s no single brain trying to push state to thousands of devices serially. Each agent works in parallel on its own box, and the controller only deals with the high-level intent. That’s the only way the math works at AI scale, and most of the automation tools you’ve used were built around the opposite assumption.

Simulation before hardware

The lifecycle pitch was the most useful operational story in the session. GPUs are far too expensive to let sit idle while the network team figures out the design. The moment the racks are powered on, the company expects them to be earning. So the workflow has to validate the design before the hardware lands.

Netris models the full topology in the controller during the order-to-delivery window, then spins up a digital twin (their own CloudSim or NVIDIA DSX Air) and brings the simulated network all the way up. Real BGP sessions, real EVPN, real config generation by the agents, but on simulated hardware. Architects walk through the topology, find the “oh, we forgot the upstream” or “wait, where does storage attach” problems while there’s still time, and tweak the model. When the physical hardware arrives, the agents come up against the same controller they were already running against in the simulation. Same intent, same configs, less surprise.

In practice, Alex pegged the typical “miswired cable” rate at around 5% in real deployments at scale. With a list of exactly which cables are in the wrong port, the install team has a punch list instead of a mystery. That’s a meaningful operational win that pre-validation makes possible.

One VPC, eight fabrics

The cloud abstraction layer Netris exposes to consumers is unapologetically borrowed from AWS. VPC for tenant isolation. VNet for the segment-equivalent (close to AWS’s subnet, but generalized so it can be layer two or layer three, VLAN or VXLAN, with or without a default gateway). VPC peering for cross-tenant access to shared services like storage. Elastic IPs for one-to-one NAT. Elastic load balancers for one-to-many. Direct Connect for tenant-owned external links via BGP into the tenant’s VPC. VPC-aware ACLs.

Alex’s argument was direct: this object model has been proven at hyperscale for fifteen years, your tenants already know it, do not reinvent it. The interesting part is what happens underneath. A VPC isn’t just a VRF. It’s a coherent abstraction that fans out across every fabric in the cluster. The controller turns it into:

  • VRFs and L2 or L3 VXLANs on the Ethernet front-end and back-end
  • Partition keys and GUID assignments on the InfiniBand fabric (synchronized with NVIDIA’s UFM controller)
  • GPU partitioning on the NVLink rack-scale switches, so 72 GPUs can be carved into 4-GPU and 8-GPU subgroups, isolated from each other

The customer creates one object. The controller decides which physical fabrics need touching and which primitives to configure on each. That’s the abstraction part of NAAM actually doing work, and it’s the difference between selling tenants a switch port and selling them a cloud.

SoftGate is bare metal because containers don’t scale

The under-discussed piece of the architecture is what Netris calls SoftGate. It’s their NAT, layer-4 load balancer, and DHCP cluster, and it runs on dedicated bare-metal x86 servers. Not VMs. Not containers. The data plane is XDP-accelerated C code, with their own implementation of Google’s Maglev consistent-hashing algorithm so the SoftGate cluster can scale horizontally without sharing state.

Their stated reason for going to bare metal: containers will not deliver multi-tenant, overlapping-IP NAT and load balancing at cloud scale, period. They tried. “That will work for a lab, but that will not be performant… you need to deal with low-level development.”

This is the kind of architectural detail that only shows up when somebody actually built the thing and discovered which pieces of the off-the-shelf stack don’t survive contact with production. SoftGate is also the bit that lets a Netris-built neo-cloud function like AWS from the tenant’s perspective. Without it, you have a beautifully orchestrated fabric and no way to give tenants public IPs that don’t collide with each other.

DPUs as full fabric members, not islands

The DPU integration story was the other piece I came in skeptical about and walked out impressed by. The naive way to use DPUs in a multi-tenant cluster is to run an overlay between the DPUs themselves and treat the switch fabric as pure transport. Alex was clear: that approach has real problems. Only DPU-enabled hosts can be VTEPs in that model, which means non-DPU servers, edge gateways, and external direct-connect circuits can’t join the same overlay. It also doesn’t scale.

Netris does the opposite. They run EVPN/BGP between every leaf switch and every DPU. The DPU becomes a first-class citizen of the fabric, not an island connected through it. A single VPC can span DPU virtual functions on GPU hosts, switch ports on storage servers, and external BGP peers, all in one isolation domain. That’s how you build a fabric that behaves like a cloud, and it’s the right architectural answer.

Honest moments

A few things landed harder because the company didn’t oversell.

“There’s no magic button between Netris and Netbox.” When delegate Jason Gintert asked about Netbox integration, Alex’s answer was clean: customers write a few lines of code between the two APIs. No productized button, no glossy slide. Refreshing in a market where every vendor claims one-click integration with everything.

“Customers built their own and migrated to us.” Alex noted that some of their customers came in on cluster eight after building their own automation in-house for clusters one through four. That’s the most useful sales signal in this category, because the hyperscalers proved the pattern works internally years ago, and everyone else is now learning they can’t afford to clone the team that built it.

“40% of clusters are still on InfiniBand.” The Ethernet-vs-InfiniBand fight is not over and Netris supports both via UFM integration for the multi-tenancy layer. Worth saying out loud because most of the “Ethernet won” pieces are written by people selling Ethernet.

“You’re never designing one cluster.” If your company is sustainable in AI, you are deploying more clusters every year, on newer GPU generations, with more fabrics each cycle. The architecture has to assume continuous deployment of new infrastructure, not a one-time build. That’s the design constraint that informs everything else.

Who this is for

Netris is a strongly opinionated stack for a specific market: neo-cloud operators, AI factories, and organizations building 1,000-to-100,000-plus-GPU clusters with multi-tenancy as a hard requirement. If you’re a traditional enterprise running a sixteen-rack data center, the ROI math doesn’t work, and you are not the customer. They aren’t pretending otherwise, and I won’t either. Trying to backfit NAAM into a small enterprise DC is the wrong use of the tool.

If you are in the neo-cloud or AI-factory category, this is the most coherent pitch I’ve heard for the layer between your physical fabric and your tenant-facing cloud portal. The hyperscalers built something that looks like NAAM internally a decade ago and have been operating it ever since. Netris is productizing that pattern for the operators who don’t have AWS-scale engineering teams to clone it from scratch. The architectural choices (no configs in the controller, distributed agents, simulation-before-hardware, AWS-shaped consumption model, full fabric-membership for DPUs, bare-metal SoftGate) all hang together.

If you’re sketching out the automation layer for your next cluster, watch this session before you write another sprint of in-house Python.

Disclosure: I attended Networking Field Day 40 as a delegate. My travel and accommodations were covered, but I was not compensated for this post, and the opinions are my own.

Other Delegate Posts

More in Switching
comments powered by Disqus

Related Posts

2015.09.19 Homelab & Misc 2 min read

Site Upgrades for September 2015

First, I want to apologize for not doing my job. Over the past couple years I’ve let this site become slightly stagnant.

2009.06.08 Fundamentals 1 min read

Demystifying Cisco Config Register Bits

Ever accidentally set your config register to a random value that isn’t in the Cisco documentation? No? Neither have I, but one day I encountered someone on #cisco that had.