The complexity required for robustness, often goes against robustness

The complexity required for robustness, often goes against robustness

In the past few months we have seen major outages from United Airlines, the NYSE, and the Wall Street Journal. With almost 5,000 flights grounded, and NYSE halting trading the cost of failure is high. When bad things happen IT personal everywhere look at increasing fault tolerance by adding redundancy mechanisms or protocols to increase robustness. Unfortunately the complexity that comes with these additional layers often comes with compromise.

The last thing your boss wants to hear is, “The network is down!”. Obviously it’s your job to prevent that from happening, but at what cost? Many of us enjoy twisting those nerd knobs, but that tends to only harbor an environment with unique problems. I too fear the current trend of adding layer after layer of network duct tape to add robustness, or worse, to try and fix shortcomings in applications. NAT, PBR, GRE, VXLAN, OTV, LISP, SDN… where does it end!?

The greater the complexity of failover, the greater the risk of failure. We often forget the lessons of our mentors, but keeping the network as simple as possible has always been best practice. As Dijkstra said, “Simplicity is a great virtue but it requires hard work to achieve it and education to appreciate it. And to make matters worse: complexity sells better”. This is a fundamental design principle that is often overlooked by enthusiastic network engineers or, even worse, sales or marketing engineers who are trained to sell ideas that only work in PowerPoint. When planning out your latest and greatest network design each and every knob that you tweak puts you farther and farther into uncharted territory. While it may work, for now, you’ll be the only one running anything close to those features in unison. And when, not if, you have to call TAC, they have to understand the fundamental design of the network BEFORE they can troubleshoot it. Validated designs do exist for a reason.

At this point in time I would encourage you to read up on a couple infamous network outages including Beth Israel Deaconess Medical Center in Boston, whose spanning tree problem took the network down for four days, and the story about how the IT Division of the Food and Agriculture Organization of the United Nations recently had a rather serious, but brief, four hour outage…

While both of these outages were simple in nature, the complexity of the growing network was key in causing the failure. A lack of continuous design, periodical review, and most important failover testing inherently nurture failure.

This article was originally posted on the Cisco Blogs. I am cross posting it here now, after waiting 6 months since the original article.

comments powered by Disqus

Related Posts

BGP Communities

BGP Communities

BGP Communities has to be one of my favorite features added to the BGP protocol.  As you should know by now, BGP passes several attributes between peers that help influence the

Read More
Nexus 1000v – Out of Ports on a Virtual Switch?

Nexus 1000v – Out of Ports on a Virtual Switch?

Yesterday, work presented an interesting issue I wanted to share with everyone. While configuring a new virtual machine one of our systems engineers was presented with an issue he …

Read More
Well-Known Intervals

Well-Known Intervals

Listed below are many events which occur on network devices at well-known intervals. The list is provided to serve as an aid while troubleshooting recurring network disruptions. …

Read More