-
Bug
-
Resolution: Won't Do
-
Major
-
None
-
4.15, 4.16
-
Moderate
-
No
-
2
-
NE Sprint 257, NE Sprint 264
-
2
-
Rejected
-
False
-
Description of problem:
First noticed in a 4.14 to 4.15 CI run:
: [sig-network-edge] ns/openshift-authentication route/oauth-openshift disruption/ingress-to-oauth-server connection/new should be available throughout the test 0s { namespace/openshift-authentication backend-disruption-name/ingress-to-oauth-server-new-connections connection/new disruption/openshift-tests route/oauth-openshift was unreachable during disruption: for at least 2s (maxAllowed=1s): Dec 18 09:54:54.536 - 999ms E namespace/openshift-authentication backend-disruption-name/ingress-to-oauth-server-new-connections connection/new disruption/openshift-tests route/oauth-openshift reason/DisruptionBegan request-audit-id/fa9942cb-cb42-4fd1-8367-8dca72d468c4 namespace/openshift-authentication backend-disruption-name/ingress-to-oauth-server-new-connections connection/new disruption/openshift-tests route/oauth-openshift stopped responding to GET requests over new connections: Get "https://oauth-openshift.apps.ci-op-229f34b6-caf63.ci2.azure.devcluster.openshift.com/healthz": dial tcp 13.89.116.81:443: i/o timeout Dec 18 09:55:15.536 - 999ms E namespace/openshift-authentication backend-disruption-name/ingress-to-oauth-server-new-connections connection/new disruption/openshift-tests route/oauth-openshift reason/DisruptionBegan request-audit-id/4a3b526d-4098-4398-a708-29ee2b238530 namespace/openshift-authentication backend-disruption-name/ingress-to-oauth-server-new-connections connection/new disruption/openshift-tests route/oauth-openshift stopped responding to GET requests over new connections: Get "https://oauth-openshift.apps.ci-op-229f34b6-caf63.ci2.azure.devcluster.openshift.com/healthz": dial tcp 13.89.116.81:443: i/o timeout}
That 443: i/o timeout disruption is brief, so likely not a huge customer issue, but still something we would like to polish off to deliver zero-disruption updates.
Version-Release number of selected component (if applicable):
Searching CI:
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=dial+tcp+[0-9.]*:443:+i/o+timeout' | grep '[0-9][0-9] runs.*failures match' | sort periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-gcp-ovn-heterogeneous (all) - 30 runs, 20% failed, 17% of failures match = 3% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-gcp-ovn-heterogeneous (all) - 30 runs, 17% failed, 40% of failures match = 7% impact periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-sdn-serial-aws-arm64 (all) - 29 runs, 97% failed, 4% of failures match = 3% impact periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 15 runs, 80% failed, 8% of failures match = 7% impact periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade (all) - 70 runs, 21% failed, 7% of failures match = 1% impact periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade (all) - 49 runs, 41% failed, 15% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 50 runs, 54% failed, 4% of failures match = 2% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 49 runs, 47% failed, 26% of failures match = 12% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade (all) - 70 runs, 44% failed, 58% of failures match = 26% impact periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade (all) - 78 runs, 68% failed, 9% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade (all) - 79 runs, 15% failed, 17% of failures match = 3% impact periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-azure-sdn-upgrade (all) - 80 runs, 25% failed, 5% of failures match = 1% impact periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-gcp-ovn-rt-upgrade (all) - 60 runs, 68% failed, 2% of failures match = 2% impact periodic-ci-openshift-release-master-nightly-4.16-e2e-vsphere-sdn (all) - 20 runs, 90% failed, 11% of failures match = 10% impact periodic-ci-redhat-openshift-ecosystem-cvp-ocp-4.14-cvp-common-claim (all) - 47 runs, 2% failed, 100% of failures match = 2% impact periodic-ci-shiftstack-shiftstack-ci-main-monitor-mecha-central (all) - 476 runs, 3% failed, 6% of failures match = 0% impact periodic-ci-shiftstack-shiftstack-ci-main-monitor-vexxhost (all) - 465 runs, 3% failed, 14% of failures match = 0% impact pull-ci-openshift-assisted-service-master-e2e-agent-compact-ipv4 (all) - 28 runs, 64% failed, 6% of failures match = 4% impact pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn (all) - 10 runs, 60% failed, 17% of failures match = 10% impact pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade (all) - 12 runs, 50% failed, 17% of failures match = 8% impact release-openshift-origin-installer-launch-azure-modern (all) - 11 runs, 36% failed, 25% of failures match = 9% impact
So almost exclusively seen in update jobs, although there are some serial and other hits too. And seems like it's mostly minor-version updates like 4.14 to 4.15, and 4.15 to 4.16, although also seen in patch updates within 4.16 and such as well.
Unclear if the lack of hits from older 4.y are because the disruption suite wasn't sophisticated enough to pick this up, or if older 4.y are actually not affected, or if their 2d run counts are just low enough that they didn't meet my "at least 2 digit run count" filter.
How reproducible:
The 6% impact for periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade is probably a good benchmark for the high-run-count, 4.16-touching jobs.
Steps to Reproduce:
1. Run tens of periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade runs.
2. Check JUnit from those runs for mentions of 443: i/o timeout.
Actual results:
Around 6% impact.
Expected results:
Zero impact.
Additional info:
Looks like we don't have 4.16 numbers in the origin disruption results yet:
$ jq -r '.[] | .Release' pkg/monitortestlibrary/allowedbackenddisruption/query_results.json | sort | uniq -c 1868 4.14 2002 4.15
But checking the 4.15 numbers for highly-available OVN on GCP (selected to get a limited matches similar to periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade):
$ jq -r '.[] | select(.Release == "4.15" and .Topology == "ha" and .MasterNodesUpdated == "Y" and .Platform == "gcp" and .Network == "ovn" and (.BackendName | endswith("-new-connections")) and .JobRuns > 20 and .P99 != "0.0") | .P99 + " " + tostring' pkg/monitortestlibrary/allowedbackenddisruption/query_results.json | sort -n 0.45 {"BackendName":"ci-cluster-network-liveness-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":834,"P95":"0.0","P99":"0.45","P75":"0.0","P50":"0.0"} 1.0 {"BackendName":"ingress-to-console-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":834,"P95":"1.0","P99":"1.0","P75":"0.0","P50":"0.0"} 1.0 {"BackendName":"kube-api-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":834,"P95":"0.0","P99":"1.0","P75":"0.0","P50":"0.0"} 2.0 {"BackendName":"cache-kube-api-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":834,"P95":"0.0","P99":"2.0","P75":"0.0","P50":"0.0"} 2.0 {"BackendName":"cache-kube-api-new-connections","Release":"4.15","FromRelease":"4.15","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":1212,"P95":"0.0","P99":"2.0","P75":"0.0","P50":"0.0"} 2.0 {"BackendName":"host-to-service-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":829,"P95":"0.0","P99":"2.0","P75":"0.0","P50":"0.0"} 2.0 {"BackendName":"host-to-service-new-connections","Release":"4.15","FromRelease":"4.15","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":1212,"P95":"0.0","P99":"2.0","P75":"0.0","P50":"0.0"} 2.0 {"BackendName":"ingress-to-console-new-connections","Release":"4.15","FromRelease":"4.15","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":1212,"P95":"1.0","P99":"2.0","P75":"0.0","P50":"0.0"} 2.0 {"BackendName":"ingress-to-oauth-server-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":834,"P95":"1.0","P99":"2.0","P75":"0.0","P50":"0.0"} 2.0 {"BackendName":"kube-api-new-connections","Release":"4.15","FromRelease":"4.15","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":1212,"P95":"0.0","P99":"2.0","P75":"0.0","P50":"0.0"} 2.0 {"BackendName":"pod-to-service-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":829,"P95":"0.0","P99":"2.0","P75":"0.0","P50":"0.0"} 2.0 {"BackendName":"pod-to-service-new-connections","Release":"4.15","FromRelease":"4.15","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":1212,"P95":"0.0","P99":"2.0","P75":"0.0","P50":"0.0"} 3.0 {"BackendName":"image-registry-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":834,"P95":"2.0","P99":"3.0","P75":"0.0","P50":"0.0"} 3.0 {"BackendName":"image-registry-new-connections","Release":"4.15","FromRelease":"4.15","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":1212,"P95":"2.0","P99":"3.0","P75":"1.0","P50":"0.0"} 3.0 {"BackendName":"ingress-to-oauth-server-new-connections","Release":"4.15","FromRelease":"4.15","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":1212,"P95":"1.0","P99":"3.0","P75":"0.0","P50":"0.0"} 3.53 {"BackendName":"service-load-balancer-with-pdb-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":826,"P95":"1.0","P99":"3.53","P75":"0.0","P50":"0.0"} 4.0 {"BackendName":"service-load-balancer-with-pdb-new-connections","Release":"4.15","FromRelease":"4.15","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":1212,"P95":"1.0","P99":"4.0","P75":"0.0","P50":"0.0"} 8.0 {"BackendName":"host-to-pod-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":834,"P95":"2.0","P99":"8.0","P75":"0.0","P50":"0.0"} 10.0 {"BackendName":"host-to-pod-new-connections","Release":"4.15","FromRelease":"4.15","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":1212,"P95":"6.0","P99":"10.0","P75":"0.0","P50":"0.0"} 57.0 {"BackendName":"cache-openshift-api-new-connections","Release":"4.15","FromRelease":"4.15","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":1212,"P95":"1.0","P99":"57.0","P75":"0.0","P50":"0.0"} 57.0 {"BackendName":"oauth-api-new-connections","Release":"4.15","FromRelease":"4.15","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":1212,"P95":"1.0","P99":"57.0","P75":"0.0","P50":"0.0"} 61.3 {"BackendName":"cache-oauth-api-new-connections","Release":"4.15","FromRelease":"4.15","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":1212,"P95":"1.0","P99":"61.3","P75":"0.0","P50":"0.0"} 62.5 {"BackendName":"openshift-api-new-connections","Release":"4.15","FromRelease":"4.15","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":1212,"P95":"1.0","P99":"62.5","P75":"0.0","P50":"0.0"} 72.0 {"BackendName":"host-to-host-new-connections","Release":"4.15","FromRelease":"4.15","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":1212,"P95":"0.0","P99":"72.0","P75":"0.0","P50":"0.0"} 74.0 {"BackendName":"pod-to-host-new-connections","Release":"4.15","FromRelease":"4.15","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":1212,"P95":"0.0","P99":"74.0","P75":"0.0","P50":"0.0"} 118.0 {"BackendName":"host-to-host-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":834,"P95":"0.0","P99":"118.0","P75":"0.0","P50":"0.0"} 118.0 {"BackendName":"pod-to-host-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":834,"P95":"1.0","P99":"118.0","P75":"0.0","P50":"0.0"} 178.9 {"BackendName":"oauth-api-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":834,"P95":"1.0","P99":"178.9","P75":"0.0","P50":"0.0"} 194.25 {"BackendName":"cache-oauth-api-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":834,"P95":"1.0","P99":"194.25","P75":"0.0","P50":"0.0"} 208.0 {"BackendName":"openshift-api-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":834,"P95":"1.0","P99":"208.0","P75":"0.0","P50":"0.0"} 208.25 {"BackendName":"cache-openshift-api-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":834,"P95":"1.0","P99":"208.25","P75":"0.0","P50":"0.0"}
The higher-disruption backends are probably something else, but the backends with p99 disruptions in the second range could all be because of this DNS instability.
- causes
-
OCPBUGS-45806 Massive service disruption during OpenShift Container Platform 4.16 upgrade due to OVS table flush
- Closed