Loading...

Type: Feature
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- network-edge
- px-reviewed

Hierarchy Progress Bar:

74% To Do, 0% In Progress, 26% Done

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

tldr: three basic claims, the rest is explanation and one example

We cannot improve long term maintainability solely by fixing bugs.
Teams should be asked to produce designs for improving maintainability/debugability.
Specific maintenance items (or investigation of maintenance items), should be placed into planning as peer to PM requests and explicitly prioritized against them.

While bugs are an important metric, fixing bugs is different than investing in maintainability and debugability. Investing in fixing bugs will help alleviate immediate problems, but doesn't improve the ability to address future problems. You (may) get a code base with fewer bugs, but when you add a new feature, it will still be hard to debug problems and interactions. This pushes a code base towards stagnation where it gets harder and harder to add features.

One alternative is to ask teams to produce ideas for how they would improve future maintainability and debugability instead of focusing on immediate bugs. This would produce designs that make problem determination, bug resolution, and future feature additions faster over time.

I have a concrete example of one such outcome of focusing on bugs vs quality. We have resolved many bugs about communication failures with ingress by finding problems with point-to-point network communication. We have fixed the individual bugs, but have not improved the code for future debugging. In so doing, we chase many hard to diagnose problem across the stack. The alternative is to create a point-to-point network connectivity capability. this would immediately improve bug resolution and stability (detection) for kuryr, ovs, legacy sdn, network-edge, kube-apiserver, openshift-apiserver, authentication, and console. Bug fixing does not produce the same impact.

We need more investment in our future selves. Saying, "teams should reserve this" doesn't seem to be universally effective. Perhaps an approach that directly asks for designs and impacts and then follows up by placing the items directly in planning and prioritizing against PM feature requests would give teams the confidence to invest in these areas and give broad exposure to systemic problems.

Relevant links:

Documentation:
- Edge Diagnostics Scratchpad, our team's internal diagnostic guide.
- Troubleshooting OCP networking issues - The complete guide, the SDN team's diagnostic guide.
- Linux Performance, Brendan Gregg's guide to analyzing Linux performance issues.
- RFC: A proper feedback loop on Alerts.
- OpenShift Router Reload Technical Overview on Access.
- Performance Scaling HAProxy with OpenShift on Access.
- How to collect worker metrics to troubleshoot CPU load, memory pressure and interrupt issues and networking on worker nodes in OCP 4 on Access.
- OpenShift Performance and Scale Knowledge Base on Mojo, results from OpenShift scalability testing.
- Scalability and performance, OCP 4.5 documentation about the router's currently known scalability limits.
- Scaling OpenShift Container Platform HAProxy Router, OCP 3.11 documentation about the manual performance configuration that was possible in OCP 3.
- Timing web requests with cURL and Chrome from the Cloudflare blog.
- tcpdump advanced filters, some useful tcpdump commands.
- OpenShift SDN - Networking, OCP 3.11 documentation on the SDN (useful background reading).
- Ingress Operator and Controller Status Conditions, design document for improved status condition reporting.
- Observability tips for HAProxy, a slide deck by Willy Tarreau.
- Interesting Traces - Out of Order versus Retransmissions, analysis using tshark.
- The PCP Book: A Complete Documentation of Performance Co-Pilot, by Yogesh Babar.
- Debugging kernel networking bug, brief guide to using SystemTap on RHCOS.
- Troubleshooting throughput issues from the OCP 4.5 documentation.
- Troubleshooting OpenShift Clusters and Workloads.
- Red Hat Enterprise Linux Network Performance Tuning Guide (PDF).
- openshift/enhancements#289 stability: point to point network check, a diagnostic built into the kube-apiserver operator.
Diagnostic tools:
- dropwatch to watch for packet drops.
- ethtool to check NIC configuration.
- iovisor/bcc: BCC - Tools for BPF-based Linux IO analysis, networking, monitoring, and more to trace and diagnose various issues in the networking stack.
- r-curler to gather timing information about HTTP/HTTPS connections.
- route-monitor, to monitor routes for reachability.
- hping(3), a programmable packet generator.
- OpenTracing / Jaeger in OpenShift.
- node-problem-detector, a possible integration point for new diagnostics.
- Using SystemTap by Brendan Gregg.
- DTrace SystemTap cheatsheet (PDF).
Visualization and more sophisticated diagnostic tools:
- eldadru/ksniff, kubectl plugin for tcpdump & Wireshark.
- ironcladlou/ditm, Dan's "Dan in the Middle" tool.
- Skydive, network diagnostic and visualization tool.
- ali, a "load testing tool capable of performing real-time analysis" with visualization.
Testing tools:
- stress-ng, a general stress-loading tool (CPU, filesystem, network, ...).
- mb, the networking benchmarking tool written and used by Jiri Mencak from our Perf+Scale team.
Case studies:
- BZ1763206 is an example of diagnosing DNS latency/timeouts.
- BZ1829779 Investigation details the diagnosis of route latency.
- BZ1845545 is an example of diagnosing misconfigured DNS for an external LB.
- Debugging network stalls on Kubernetes, from the GitHub Blog, about diagnosing Kubernetes performance issues related to ksoftirqd.

impacts account

NE-566 [Tech Debt] [Diagnostics] HAProxy Troubleshooting Enhancements

New

NE-557 [Tech Debt] [Observability] Router doesn't verify the generated haproxy config on a per-route level

New

NE-570 [Tech Debt] [Maint] Canary: Add router's certificate to canary client trust bundle.

New

NE-571 [Tech Debt] [Perf+Scale] Load testing dns (dnsblast?) looking for latency and jitter

New

NE-575 [Tech Debt] [Observability] Improve cluster-network-operator's status reporting for invalid proxy config

New

NE-582 [Tech Debt] [Testing+CI] Review OCP 3.11 documentation and identify gaps in test coverage

New

NE-678 [Tech Debt] [Observability] Add status to routes

New

NE-680 [Tracking Upstream] Switch OpenShift router to HAProxy 2.6

Closed

NE-554 [Tech Debt] [Maint] Remove go-bindata dep from Ingress and DNS operators

Closed

NE-580 [Tracking Upstream] Bump HAProxy to 2.2.15

Closed

NE-581 [Tracking Upstream] Bump openshift/coredns to the latest upstream release

Closed

OCPSTRAT-285 Upgrade OpenShift Router to HAProxy 2.6

Closed

relates to

NE-680 [Tracking Upstream] Switch OpenShift router to HAProxy 2.6

Closed

OCPSTRAT-285 Upgrade OpenShift Router to HAProxy 2.6

Closed

(7 impacts account, 2 relates to)

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates