Wednesday, December 22, 2021

The QOS Fallacies & Failures in a Modern Hybrid IT World - Part 1 of 2

 


When I wrote about QOS the last time around 7 years ago, I must say I had high hopes from SDN & IBN as both of the paradigms were still evolving. Meanwhile there have been some unsuccessful attempt to automate QOS by throwing some sort of controllers into the mix by few vendors beside some others claiming they can solve this problem with the mighty IBN.

As we are about to move into 2022, many still wonder if QOS makes any sense at all in the context of modern networking ?

In order to find the answers, let's break the problem into two parts:

1. Why QOS has been so unsuccessful historically
2. What are our options moving forward

So let's focus on point 1 to begin with.

1. How do we get started ? - Interestingly enough over a dozen books have been written on QOS over the last 2 decades or so in the context of IP networking which mostly talks about details such as congestion management vs. congestion avoidance and so forth. But very few of them actually jumps into platform specifics in terms of capabilities and dependencies (both HW & SW). More interestingly I personally haven't come across a single QOS book myself yet which gives you any practical advice or framework/methodology around how to gather technical requirements in reference to Applications in order to plan and craft a QOS policy. So often I have seen people struggling to come up with one and given most QOS deployments are tactical rather than strategic, people often have time constraints to come up with a one in a short time. That's why many times people end up coping some references from recommended design guides etc. which hardly works in real life (Unless you were too lucky !).

2. Benchmarking & Capacity Mgmt. - Most small, medium & even couple of the large enterprises that I have worked with including Telcos don't seem to have both of these as mature practices in place. Benchmarking is though one of the key exercises you need get through to craft a good QOS policy beside being a necessity in any of a mature Capacity mgmt. framework/practice. The other problem you may likely to run here is that in order to do effective benchmarking & capacity mgmt. you need to invest into additional visibility & performance mgmt. tools to gather the required details which are usually quite expensive beside that fact that you need to train your team on tools and required operating skills (for example statistical analysis, Time Series, Sampling details etc.). Certain times you are likely to run into the problem where in the given tool may not be able to offer you reporting/data that you need natively which means you are always dependent on tool vendor about if their product road-map is aligned to your priorities and timelines. And be careful about if the tool allows you to run custom reports or exports required data in format you need. So if your capacity mgmt. is still on excel sheets, you know where you are heading.

3. Measuring latency incorrectly - This is perhaps more common than you might have thought beside that fact that most QOS books don't offer any practical advice here too and details around how latency needs to broken down across the spectrum. Can you tell me the breakup of end to end latency (Server - Client) and that too on hop by hop basis ?

4. QOS Lifecycle Mgmt. - While this area has improved a bit when it comes to modern networking gear, assuming majority of equipment still out there are old ones which doesn't offer much when it comes to QOS lifecycle mgmt. that includes Plan, Design & Implementation. But more importantly what they lack are capabilities such as QOS monitoring & reporting in real time beside the correlation with network health & events. After all you don't want to hire someone today to type couple of show commands in every few minutes and running the scripts won't be that helpful either for most part when you are dealing with scale. Again there are couple of commercial and open source tools available but its an exercise which takes time and resources beside that fact you should know exactly what you are looking for in which scenario. BTW...will that resource be from planning team, tools team or ops team is what I leave for you to figure out in real life. :)

Also with the rise of modern solutions which are mostly built around the magical controllers and overlays, you must think about QOS FCAPS capabilities in such environment carefully. For example while from overlay protocol perspective everything might be just a single hop away, the packet eventually still gets passed through the physical world in underlay. So its importantly to find out early if QOS policies will be:

- Static or Dynamic in nature (Given you are using controller of some sort)
- Correlation of QOS statistics between underlay & overlay
- How policy gets propagated 
- How controller interacts with other systems and policies for introducing dynamic behavior 
- Does the system allows Time Based QOS policies (interestingly enough most don't yet)
- Dummy policy dry run capabilities if supported 
- How your QOS policy gels with your ISP agreements and how systems would talk to each other if at all depending upon SLA, Performance & Visibility/Reporting requirements both may agree upon

5. Policy Stitching - This is one of the hardest part to get across and more so in a multi-vendor environment. As mentioned earlier - beside the fact that most QOS books and vendor QOS courses don't cover much details around platform specifics and they just assume that one would figure it out, the things gets pretty complicated pretty quickly the moment you know that the QOS depends on:

- Platform and Specific Model you are using
- NOS version
- ASIC Architecture (ASIC Pipeline, Buffer, Memory type & speed, Over subscription, Queue/Dequeue algorithm etc.)
- Chassis specifics (in case you are using one as opposed to fixed form factor) - example VOQ, Switch Fabric, Fabric Generation, Fabric Modules Count etc...
- Supervisor Engine & Architecture beside its generation, CEF vs. dCEF kind of implementation specifics 
- Policy Framework supported by NOS - example hierarchical QOS, support for sub-interfaces/Logical interfaces, how policy aggregation works and in which direction etc.

6. Modern App. Architectures - Since these days some of the new buzz words in application space are Cloud Native Apps, Micro Services, Containers & Kubernetes etc. One might wonder how he/she would go about planning QOS for such environment which are highly dynamic in nature with complex topologies, both short & long lived flows with mix of interaction surfaces with other systems and tools such as distributed tracing to feedback into your QOS Mgmt. tool.

7. Complexity Induced by Networks - There are some very common network choices that every network architect makes at some point which further complicates the QOS implementation. These are perhaps some of those complexities which must exist in order to deliver the desired outcomes and are least avoidable such as:

- MC-LAG aka Port-Channels/Ether-Channels/Bundle Interfaces
- Multi-tenant Networks (Remember you only got few queues in HW)
- Dynamic Network Traffic Patterns (Even more so with TE Controllers) during stable conditions vs. failure conditions
- How TC gets implemented by a given vendor in given platform & NOS
- Inflated throughput & performance numbers by vendors (very common)
- Different SP QOS Models (Customer Facing vs. Core Facing) as they usually have no more than 3 bits or 8 classes to play around beside allocation models
- QOS in Dual Stack Networks vs. IPv6 only Networks
- Some nerd knobs such as QPPB
- Impact of Physical & Logical topology

Hope you find this helpful and lets continue with this in Part-2.

HTH...

A Network Artist ðŸŽ¨