The complexity required for robustness, often goes against robustness

The complexity required for robustness, often goes against robustness

In the past few months we have seen major outages from United Airlines, the NYSE, and the Wall Street Journal. With almost 5,000 flights grounded, and NYSE halting trading the cost of failure is high. When bad things happen IT personal everywhere look at increasing fault tolerance by adding redundancy mechanisms or protocols to increase robustness. Unfortunately the complexity that comes with these additional layers often comes with compromise.

The last thing your boss wants to hear is, “The network is down!”. Obviously it’s your job to prevent that from happening, but at what cost? Many of us enjoy twisting those nerd knobs, but that tends to only harbor an environment with unique problems. I too fear the current trend of adding layer after layer of network duct tape to add robustness, or worse, to try and fix shortcomings in applications. NAT, PBR, GRE, VXLAN, OTV, LISP, SDN… where does it end!?

The greater the complexity of failover, the greater the risk of failure. We often forget the lessons of our mentors, but keeping the network as simple as possible has always been best practice. As Dijkstra said, “Simplicity is a great virtue but it requires hard work to achieve it and education to appreciate it. And to make matters worse: complexity sells better”. This is a fundamental design principle that is often overlooked by enthusiastic network engineers or, even worse, sales or marketing engineers who are trained to sell ideas that only work in PowerPoint. When planning out your latest and greatest network design each and every knob that you tweak puts you farther and farther into uncharted territory. While it may work, for now, you’ll be the only one running anything close to those features in unison. And when, not if, you have to call TAC, they have to understand the fundamental design of the network BEFORE they can troubleshoot it. Validated designs do exist for a reason.

At this point in time I would encourage you to read up on a couple infamous network outages including Beth Israel Deaconess Medical Center in Boston, whose spanning tree problem took the network down for four days, and the story about how the IT Division of the Food and Agriculture Organization of the United Nations recently had a rather serious, but brief, four hour outage…

While both of these outages were simple in nature, the complexity of the growing network was key in causing the failure. A lack of continuous design, periodical review, and most important failover testing inherently nurture failure.

This article was originally posted on the Cisco Blogs. I am cross posting it here now, after waiting 6 months since the original article.

comments powered by Disqus

Related Posts

Mail Server Relay Testing

Mail Server Relay Testing

As you may know, I am not big in the server world, even less into mail servers. They make me sick. But, today I found a nifty little tool you can use to test your mail server …

PCAP t-shirts just in time for CLUS17

PCAP t-shirts just in time for CLUS17

Hey guys, I just wanted to drop a quick note to let you know that I’ve relaunched my teespring shirt campaigns with enough time that you should get your orders before Cisco Live US …

BPDU – Blog Post Data Unit?

BPDU – Blog Post Data Unit?

My most recently collection of interesting bits of data found out on the blogsphere/internets. Due to my lack of time, I’ve decided to recycle what I find out on the ‘net and share …