Wednesday, December 22, 2021

The QOS Fallacies & Failures in a Modern Hybrid IT World - Part 1 of 2

When I wrote about QOS the last time around 7 years ago, I must say I had high hopes from SDN & IBN as both of the paradigms were still evolving. Meanwhile there have been some unsuccessful attempt to automate QOS by throwing some sort of controllers into the mix by few vendors beside some others claiming they can solve this problem with the mighty IBN.

As we are about to move into 2022, many still wonder if QOS makes any sense at all in the context of modern networking ?

In order to find the answers, let's break the problem into two parts:

1. Why QOS has been so unsuccessful historically

2. What are our options moving forward

So let's focus on point 1 to begin with.

1. How do we get started ? - Interestingly enough over a dozen books have been written on QOS over the last 2 decades or so in the context of IP networking which mostly talks about details such as congestion management vs. congestion avoidance and so forth. But very few of them actually jumps into platform specifics in terms of capabilities and dependencies (both HW & SW). More interestingly I personally haven't come across a single QOS book myself yet which gives you any practical advice or framework/methodology around how to gather technical requirements in reference to Applications in order to plan and craft a QOS policy. So often I have seen people struggling to come up with one and given most QOS deployments are tactical rather than strategic, people often have time constraints to come up with a one in a short time. That's why many times people end up coping some references from recommended design guides etc. which hardly works in real life (Unless you were too lucky !).

2. Benchmarking & Capacity Mgmt. - Most small, medium & even couple of the large enterprises that I have worked with including Telcos don't seem to have both of these as mature practices in place. Benchmarking is though one of the key exercises you need get through to craft a good QOS policy beside being a necessity in any of a mature Capacity mgmt. framework/practice. The other problem you may likely to run here is that in order to do effective benchmarking & capacity mgmt. you need to invest into additional visibility & performance mgmt. tools to gather the required details which are usually quite expensive beside that fact that you need to train your team on tools and required operating skills (for example statistical analysis, Time Series, Sampling details etc.). Certain times you are likely to run into the problem where in the given tool may not be able to offer you reporting/data that you need natively which means you are always dependent on tool vendor about if their product road-map is aligned to your priorities and timelines. And be careful about if the tool allows you to run custom reports or exports required data in format you need. So if your capacity mgmt. is still on excel sheets, you know where you are heading.

3. Measuring latency incorrectly - This is perhaps more common than you might have thought beside that fact that most QOS books don't offer any practical advice here too and details around how latency needs to broken down across the spectrum. Can you tell me the breakup of end to end latency (Server - Client) and that too on hop by hop basis ?

4. QOS Lifecycle Mgmt. - While this area has improved a bit when it comes to modern networking gear, assuming majority of equipment still out there are old ones which doesn't offer much when it comes to QOS lifecycle mgmt. that includes Plan, Design & Implementation. But more importantly what they lack are capabilities such as QOS monitoring & reporting in real time beside the correlation with network health & events. After all you don't want to hire someone today to type couple of show commands in every few minutes and running the scripts won't be that helpful either for most part when you are dealing with scale. Again there are couple of commercial and open source tools available but its an exercise which takes time and resources beside that fact you should know exactly what you are looking for in which scenario. BTW...will that resource be from planning team, tools team or ops team is what I leave for you to figure out in real life. :)

Also with the rise of modern solutions which are mostly built around the magical controllers and overlays, you must think about QOS FCAPS capabilities in such environment carefully. For example while from overlay protocol perspective everything might be just a single hop away, the packet eventually still gets passed through the physical world in underlay. So its importantly to find out early if QOS policies will be:

- Static or Dynamic in nature (Given you are using controller of some sort)

- Correlation of QOS statistics between underlay & overlay

- How policy gets propagated

- How controller interacts with other systems and policies for introducing dynamic behavior

- Does the system allows Time Based QOS policies (interestingly enough most don't yet)

- Dummy policy dry run capabilities if supported

- How your QOS policy gels with your ISP agreements and how systems would talk to each other if at all depending upon SLA, Performance & Visibility/Reporting requirements both may agree upon

5. Policy Stitching - This is one of the hardest part to get across and more so in a multi-vendor environment. As mentioned earlier - beside the fact that most QOS books and vendor QOS courses don't cover much details around platform specifics and they just assume that one would figure it out, the things gets pretty complicated pretty quickly the moment you know that the QOS depends on:

- Platform and Specific Model you are using

- NOS version

- ASIC Architecture (ASIC Pipeline, Buffer, Memory type & speed, Over subscription, Queue/Dequeue algorithm etc.)

- Chassis specifics (in case you are using one as opposed to fixed form factor) - example VOQ, Switch Fabric, Fabric Generation, Fabric Modules Count etc...

- Supervisor Engine & Architecture beside its generation, CEF vs. dCEF kind of implementation specifics

- Policy Framework supported by NOS - example hierarchical QOS, support for sub-interfaces/Logical interfaces, how policy aggregation works and in which direction etc.

6. Modern App. Architectures - Since these days some of the new buzz words in application space are Cloud Native Apps, Micro Services, Containers & Kubernetes etc. One might wonder how he/she would go about planning QOS for such environment which are highly dynamic in nature with complex topologies, both short & long lived flows with mix of interaction surfaces with other systems and tools such as distributed tracing to feedback into your QOS Mgmt. tool.

7. Complexity Induced by Networks - There are some very common network choices that every network architect makes at some point which further complicates the QOS implementation. These are perhaps some of those complexities which must exist in order to deliver the desired outcomes and are least avoidable such as:

- Snowflake Networks

- MC-LAG aka Port-Channels/Ether-Channels/Bundle Interfaces

- Multi-tenant Networks (Remember you only got few queues in HW)

- Dynamic Network Traffic Patterns (Even more so with TE Controllers) during stable conditions vs. failure conditions

- How TC gets implemented by a given vendor in given platform & NOS

- Inflated throughput & performance numbers by vendors (very common)

- Different SP QOS Models (Customer Facing vs. Core Facing) as they usually have no more than 3 bits or 8 classes to play around beside allocation models

- QOS in Dual Stack Networks vs. IPv6 only Networks

- Some nerd knobs such as QPPB

- Impact of Physical & Logical topology

Hope you find this helpful and lets continue with this in Part-2.

HTH...

A Network Artist 🎨

Wednesday, November 17, 2021

A Simple Routing Protocols Decomposition Model - Part 2 (Peering Mgmt.)

So in the first part of the series, we started with rather a simple decomposition model to get bit more insights into the routing protocol internals.

So let's continue the series by expanding on the very first layer in our model - Peering Management.

In any routing protocol before we get fancy in terms of which all features, functions and knobs to use, the very basic requirement is to peer with other devices into the network since eventually a routing protocol is nothing but a distributed database. Routing protocols are designed to convey different set of information to its peer devices such as:

- Topology Information

- Reachability Information

- Policy Information

The type of information that a given routing protocol would exchange with its peers would largely depend on the protocol itself (OSPF, IS-IS, BGP) & where that proposed routing protocol is used into the network (Campus, WAN, Metro-E, DC) as implementation specifics do change.

[ Click on Image to Enlarge ]

As you may notice, there are lot of things working behind the scenes when it comes to peering in a routing protocol context. But don't get carried away by looking at the complexity. Once you look closely, all of these pieces kind of makes sense.

So let's start with the top row, reading it from left to right.

Self Identity - Before the routing protocol determines whom it needs to communicate with and what information needs to be exchanged, it must find its own identity first. The most common way to give an identity to the routing protocol instance itself is assigning it a router id (RID). The RID can be configured manually or it can be derived automatically depending upon the platform and NOS.

In real life assuming network virtualization is much more common today, you are allowed to configure unique RID for each routing protocol as well as a unique RID for each instance/process under same protocol.

Protocol Addressing - Once the routing protocol is able to define its identify with a RID, the next step is to understand it's addressing. The addressing serves many purposes but the most basic one is to provide location services from the overall network view standpoint. Though protocols such as LISP was an attempt to separate device identity from device location, due to limited use cases (such as mobility) and other problems it never really took off well.

Every routing protocol has it's own addressing scheme which may further have impact on its scaling and expected working behavior if not done correctly.

Participation - The next step for the routing protocol is to determine its participating interfaces on a given device and in certain cases the entire device itself. Depending upon the design you may run into some interesting challenges though.

Reliable Transport - Every routing protocol needs a reliable transport to be able to effectively communicate with its peer device. The reliability itself is an important aspect and while some routing protocols such as BGP rely upon existing TCP stack, others like OSPF uses IP protocol 89 & EIGRP choose its own transport protocol namely RTP.

Neighbor Discovery - In the next step the protocol must discover its peer/neighbor/adjacent device depending upon which routing protocol you are following. The protocol while ideally should keep a track of its neighbor and relationship state, this may or may not be implemented.

The neighbor could be configured manually or dynamically discovered, while the discovery phase itself may use unicast or multicast as a transport for reachability purpose to the next hop device. Also the reachability to the neighbor could be over layer 2 transport or layer 3 transport depending upon the protocol. IS-IS and many other IOT industrial protocols operate at layer 2 for example. In case of eBGP, the neighbor in fact may be multiple physical hops away.

Neighbor Identity - While we may have discovered our neighbor, it doesn't mean the neighbor itself is a an intended neighbor or a legitimate neighbor always. After all someone might want to spoof or sometimes we may end up discovering somebody completely un-intentionally. So sharing any information with an unexpected neighbor won't make any sense. To prevent this we have several measures which we can put in place such as Authentication, Validating neighbor's identity (remember they also have RID), Validating neighbor based on IP Packet's TTL value etc. such as in OSPF & BGP. The modern day solutions such as SD-WAN usually uses RPKI over TLS/DTLS channel.

Establish Session - Once the neighbor is discovered and validated, we finally establish a session with it. Depending upon the protocol, we might have a single session vs. multiple sessions going on. A simple example would be networks running IPv4 & IPv6 at the same time under single routing protocol instance. While some implementations exchange information related to both IPv4 and IPv6 over a single session, some may do it over a separate dedicated session for each of them. Long time ago there was an attempt to run multi session bgp for MTR (Multi Topology Routing) for network virtualization use cases.

Capabilities Exchange - This is an another interesting step where the routing protocol running on separate devices exchange capabilities with each other to find the lowest common denominator. For example we know BGP is more like an application that runs on top of TCP as opposed to a pure layer 3 routing protocol which only carries routes. BGP though can carry layer 3 routing information and in case with most vendors it's been the default behavior, BGP does allow us to carry many other set of information depending upon the use case in the form of AFI/SAFI which is essentially an encoding format. For example BGP can carry MAC Addresses information under Layer 2 VPN EVPN address family when enabled. Though with BGP, you got to be cautious about enabling a new capability/address family in production network as highlighted here.

Establish Adjacency - The protocol reach this far and finally the peers are ready to exchange required set of information needed to populate RIB & other details such as topology graph. An interesting example in a routing protocol context would be (In case you are wondering in which scenario two devices would be neighbors but not adjacent) OSPF.

Messages Exchange - Finally we [assuming by now you are thinking like a routing protocol :) ] reach this stage where in we finally start exchanging information through messages. The messages needs to be sent reliably, keeping track of to understand which one to prefer in case of being received from multiple sources, acknowledged and so forth besides how to queue and dequeue them and at what intervals those should be sent out vs. being hold back for a while to pack multiple events together for optimization and getting the latest information being sent out.

Tuesday, November 2, 2021

A Simple Routing Protocols Decomposition Model - Part 1

You might have heard about this term "Mental Models" earlier too. Mental models are essentially a very simple yet very powerful tool. At its core the purpose of mental models is to provide you with necessary tools and building blocks in order to understand how something is really built and how it works behind the scenes.

Whether we recognize this or not, we all have mental models of some kind in our mind that we use and apply to situations in our daily life. It's just that some of those are developed consciously to solve certain problems or how to we approach certain things, on the other hand some of those are genetically coded into our DNA by nature.

A quick example of consciously developed mental model would be "How do you manage your finances" while the unconscious one would be to "Run fast when you see a lion charging towards you".

The mental models can be simple or complex, layered and even blend of multiple models depending upon how complex the topic is at hand and how deep you want to go down the rat hole beside in some cases tracing down the roots of the real problem into its adjacent domains.

Last but not least, your experiments and experiences always help you enrich you mental model aka "Lesson learned the hard way". On the flip side mental models do create "biases" if totally ignored, as one of the tradeoffs to follow them. But we will keep that topic besides the other tradeoffs for another time.

Now you might be wondering:

1. What is than Decomposition Model

2. Do we have some of those being available for IP Networking

To answer the 1st one - The famous consulting firms were looking for some more glamorous term which would resonate better with business people, after all technology details looks boring to them for most part and they started calling it "Decomposition Models" and heavily use this term in various cases such as for "Maturity Model" .

Coming back to 2nd one which is more relevant for this conversation, IP Networking actually taught us few mental models at the very beginning of our career as Network Engineers in the form of - OSI Model, TCP/IP Model, RINA Model, Hourglass Model and so forth. But as we advance further into our journey, the numbers of models available to us starts to shrink if not completely disappear with an ultimate answer to every question not as the "number 42" but the magical two words " It depends ". Standard bodies such as IETF, IEEE, MEF & ONUG etc. don't offer much help either for the most part.

But then nothing stops you from to being a little more creative and come up with your own models.

Let's start with a " Routing Protocol Decomposition Model ".

It's a multi-purpose simple yet effective model that you can consider for :

- As a Routing Protocol Designer it allows you to breakdown the complex equation into smaller chunks

- As a Network Architect it allows you to breakdown you design choices into smaller & distinguishable sections/containers and allow you to tune the variables to achieve certain outcomes while having a better view of "Tradeoffs" that you are going to make as part of the process

- As an Implementation Engineer it allows you to breakdown the device configuration into logical containers which are easier to manage, understand & configure beside without falling in the trap of - Order of operation issues

- As an Automation Engineer (NetDevOPS, NetOPS, NetSecDevOPS of whatever else you may call it) it allows you to break down the protocol in such a way that you can now easily write a Data Model to code it, write test units for each individual block and so forth beside develop a "YANG Model" or build a " CICD Pipeline "

- As an Operation Engineer or TAC Engineer it allows you to approach the whole process more logically if you follow the direction of "Arrow" as it allows you to understand "Dependencies" as part of layered approach

- As a Systems Architect it allows you think clearly about " Interaction Surfaces" & " Leaky Abstractions " beside understanding " Dependency Relation "

- Helps you form an " Abstraction Model " to hide complexity & inner-workings/details

- It allows you to plan " Streaming Telemetry " better as you can map your use cases and dependencies more easily and feed it into a magical " Intent Based Networking (IBN) " solution

- A consultant can use this for "Maturity Modelling" & " Assessment/Observability "

- A tool to apply " Critical Thinking " to routing protocols

In next part of this series we we bring this model from current 10,000 feet view to 1,000 feet view by start populating sub blocks under each major block.

Meanwhile as an exercise, try to think of any feature or knob of your favorite routing protocol as see if there is anything you can think of that wouldn't fit into any of those layers. :)

HTH...

A Network Artist 🎨

Monday, October 18, 2021

State of Networking Industry In 2021... a bit of sarcasm with a pinch of salt :)

- Network Engineers those are still using "CLI" are "CLI junkies" & "dated"

- Network Engineers those switched to GUI in the holy name of SDN/IBN Solutions are " cool "

- Network Engineers those are often writing scripts in YAML & JINJA are "coolest"....shall we call it "The Holy CLI Mode" ?

- Network Engineers using "Machine Learning" are " Gods of the thunder"

And now You can safely forget about - Robust Network Design & Implementation, Standardization, Modularity & Failure domains, Statistical Analysis, People-Processes-Tools

HTH...

A Network Artist 🎨

Facebook Down Event - The dilemma of a CTO, Black Swans & Fallacies of IBN

While Facebook just seem to have published somewhat a lengthysh version of root cause analysis (https://lnkd.in/guptSB3u) for public about their recent worldwide network outage that made - Facebook, WhatsApp & Instagram completely cut out from internet, it must have raised some concerns in the worldwide CTO and CIO community.

Since historically they have been told that and what pretty much every vendor in networking industry is preaching about in terms of different ways & methods (Systems, People & Processes) to avoid such circumstances are bright & magical ideas such as:

- Automation & Orchestration
- Intent Based Networking (IBN)
- Software Defined Networking (SDN)
- Centralized Controllers
- Data Models
- Automated Test & Deployment Pipeline with Unit Tests (aka CICD)
- Reliability & Resiliency Engineering
- AI/ML OPS
- Network Design Principles (Hierarchy, Swim Lanes, Segmentation etc.)
- Streaming Telemetry
- Observability Tools
- Bright Engineers
- Testbed Equipment
- Network Modelling & Simulation Tools with Formal Verification
- Rigorous platforms testing (HW/SW)
- Single Source of Truth
- Chaos Engineering
- BCP Plan
- Correlation Tools & what not

But assuming if you go via this checklist, Facebook would probably have all checks against all these items and so would be any of the FAANG company at this stage.

" So assuming you are a CTO or CIO, what would you suggest as possible next steps to your CEO & board if you have been called up this week for a meeting to discuss about how do we ensure such events don't happen in our network ? "

So lets park the above question for a while and move to what reactions we have seen so far.

1. The usual suspect is, bad things happens and everything breaks at some point, focus on RCA...move on and ensure it doesn't happen again

2. Network Architects favorite answer.... " it depends "

3. Was it a People or Process issue ?

4. The conspiracy theory that FB was under a Cyber attack which they don't want to disclose

5. Blame BGP (the easy suspect) … interestingly we got 10000+ new BGP experts on twitter and LinkedIn overnight :) beside the fact that 99% of them hardly understand the BGP details since none of them looked at the problem from perspectives of "unintended consequences", "ripple effect", "interaction surfaces", "failure domains", " & so forth beside all the pointers list I shared above. So let's say blaming BGP was an easy pick for the "ghost" network engineers. Beside the fact that RCA published by Facebook doesn't cover any technical details either.

6. "The Black Swans" - This is an interesting one and less talked about fact in case of this outage. While some may claim this was just one of those black swan events, I personally seriously doubt that and more so in the absence of a detailed RCA.