tldr: three basic claims, the rest is explanation and one example
- We cannot improve long term maintainability solely by fixing bugs.
- Teams should be asked to produce designs for improving maintainability/debugability.
- Specific maintenance items (or investigation of maintenance items), should be placed into planning as peer to PM requests and explicitly prioritized against them.
While bugs are an important metric, fixing bugs is different than investing in maintainability and debugability. Investing in fixing bugs will help alleviate immediate problems, but doesn't improve the ability to address future problems. You (may) get a code base with fewer bugs, but when you add a new feature, it will still be hard to debug problems and interactions. This pushes a code base towards stagnation where it gets harder and harder to add features.
One alternative is to ask teams to produce ideas for how they would improve future maintainability and debugability instead of focusing on immediate bugs. This would produce designs that make problem determination, bug resolution, and future feature additions faster over time.
I have a concrete example of one such outcome of focusing on bugs vs quality. We have resolved many bugs about communication failures with ingress by finding problems with point-to-point network communication. We have fixed the individual bugs, but have not improved the code for future debugging. In so doing, we chase many hard to diagnose problem across the stack. The alternative is to create a point-to-point network connectivity capability. this would immediately improve bug resolution and stability (detection) for kuryr, ovs, legacy sdn, network-edge, kube-apiserver, openshift-apiserver, authentication, and console. Bug fixing does not produce the same impact.
We need more investment in our future selves. Saying, "teams should reserve this" doesn't seem to be universally effective. Perhaps an approach that directly asks for designs and impacts and then follows up by placing the items directly in planning and prioritizing against PM feature requests would give teams the confidence to invest in these areas and give broad exposure to systemic problems.
- Edge Diagnostics Scratchpad, our team's internal diagnostic guide.
- Troubleshooting OCP networking issues - The complete guide, the SDN team's diagnostic guide.
- Linux Performance, Brendan Gregg's guide to analyzing Linux performance issues.
- RFC: A proper feedback loop on Alerts.
- OpenShift Router Reload Technical Overview on Access.
- Performance Scaling HAProxy with OpenShift on Access.
- How to collect worker metrics to troubleshoot CPU load, memory pressure and interrupt issues and networking on worker nodes in OCP 4 on Access.
- OpenShift Performance and Scale Knowledge Base on Mojo, results from OpenShift scalability testing.
- Scalability and performance, OCP 4.5 documentation about the router's currently known scalability limits.
- Scaling OpenShift Container Platform HAProxy Router, OCP 3.11 documentation about the manual performance configuration that was possible in OCP 3.
- Timing web requests with cURL and Chrome from the Cloudflare blog.
- tcpdump advanced filters, some useful tcpdump commands.
- OpenShift SDN - Networking, OCP 3.11 documentation on the SDN (useful background reading).
- Ingress Operator and Controller Status Conditions, design document for improved status condition reporting.
- Observability tips for HAProxy, a slide deck by Willy Tarreau.
- Interesting Traces - Out of Order versus Retransmissions, analysis using tshark.
- The PCP Book: A Complete Documentation of Performance Co-Pilot, by Yogesh Babar.
- Debugging kernel networking bug, brief guide to using SystemTap on RHCOS.
- Troubleshooting throughput issues from the OCP 4.5 documentation.
- Troubleshooting OpenShift Clusters and Workloads.
- Red Hat Enterprise Linux Network Performance Tuning Guide (PDF).
- openshift/enhancements#289 stability: point to point network check, a diagnostic built into the kube-apiserver operator.
- Diagnostic tools:
- dropwatch to watch for packet drops.
- ethtool to check NIC configuration.
- iovisor/bcc: BCC - Tools for BPF-based Linux IO analysis, networking, monitoring, and more to trace and diagnose various issues in the networking stack.
- r-curler to gather timing information about HTTP/HTTPS connections.
- route-monitor, to monitor routes for reachability.
- hping(3), a programmable packet generator.
- OpenTracing / Jaeger in OpenShift.
- node-problem-detector, a possible integration point for new diagnostics.
- Using SystemTap by Brendan Gregg.
- DTrace SystemTap cheatsheet (PDF).
- Visualization and more sophisticated diagnostic tools:
- Testing tools:
- Case studies:
- BZ1763206 is an example of diagnosing DNS latency/timeouts.
- BZ1829779 Investigation details the diagnosis of route latency.
- BZ1845545 is an example of diagnosing misconfigured DNS for an external LB.
- Debugging network stalls on Kubernetes, from the GitHub Blog, about diagnosing Kubernetes performance issues related to ksoftirqd.