Monday, October 18, 2021

Facebook Down Event - The dilemma of a CTO, Black Swans & Fallacies of IBN


While Facebook just seem to have published somewhat a lengthysh version of root cause analysis (https://lnkd.in/guptSB3u) for public about their recent worldwide network outage that made - Facebook, WhatsApp & Instagram completely cut out from internet, it must have raised some concerns in the worldwide CTO and CIO community.


Since historically they have been told that and what pretty much every vendor in networking industry is preaching about in terms of different ways & methods (Systems, People & Processes) to avoid such circumstances are bright & magical ideas such as:

- Automation & Orchestration
- Intent Based Networking (IBN)
- Software Defined Networking (SDN)
- Centralized Controllers
- Data Models
- Automated Test & Deployment Pipeline with Unit Tests (aka CICD)
- Reliability & Resiliency Engineering
- AI/ML OPS
- Network Design Principles (Hierarchy, Swim Lanes, Segmentation etc.)
- Streaming Telemetry
- Observability Tools
- Bright Engineers
- Testbed Equipment
- Network Modelling & Simulation Tools with Formal Verification
- Rigorous platforms testing (HW/SW)
- Single Source of Truth
- Chaos Engineering
- BCP Plan
- Correlation Tools & what not

But assuming if you go via this checklist, Facebook would probably have all checks against all these items and so would be any of the FAANG company at this stage.

" So assuming you are a CTO or CIO, what would you suggest as possible next steps to your CEO & board if you have been called up this week for a meeting to discuss about how do we ensure such events don't happen in our network ? "

So lets park the above question for a while and move to what reactions we have seen so far.

1. The usual suspect is, bad things happens and everything breaks at some point, focus on RCA...move on and ensure it doesn't happen again

2. Network Architects favorite answer.... " it depends "

3. Was it a People or Process issue ?

4. The conspiracy theory that FB was under a Cyber attack which they don't want to disclose

5. Blame BGP (the easy suspect) … interestingly we got 10000+ new BGP experts on twitter and LinkedIn overnight :) beside the fact that 99% of them hardly understand the BGP details since none of them looked at the problem from perspectives of "unintended consequences", "ripple effect", "interaction surfaces", "failure domains", " & so forth beside all the pointers list I shared above. So let's say blaming BGP was an easy pick for the "ghost" network engineers. Beside the fact that RCA published by Facebook doesn't cover any technical details either.

6. "The Black Swans" - This is an interesting one and less talked about fact in case of this outage. While some may claim this was just one of those black swan events, I personally seriously doubt that and more so in the absence of a detailed RCA.

Further Readings:















HTH...

A Network Artist ðŸŽ¨

No comments: