Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-25640

Clusters should not have ~1s HTTPS i/o timeout blips during updates

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Won't Do
    • Icon: Major Major
    • None
    • 4.15, 4.16
    • Networking / DNS
    • Moderate
    • No
    • 2
    • NE Sprint 257, NE Sprint 264
    • 2
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      First noticed in a 4.14 to 4.15 CI run:

      : [sig-network-edge] ns/openshift-authentication route/oauth-openshift disruption/ingress-to-oauth-server connection/new should be available throughout the test	0s
      {  namespace/openshift-authentication backend-disruption-name/ingress-to-oauth-server-new-connections connection/new disruption/openshift-tests route/oauth-openshift was unreachable during disruption:  for at least 2s (maxAllowed=1s):
      
      Dec 18 09:54:54.536 - 999ms E namespace/openshift-authentication backend-disruption-name/ingress-to-oauth-server-new-connections connection/new disruption/openshift-tests route/oauth-openshift reason/DisruptionBegan request-audit-id/fa9942cb-cb42-4fd1-8367-8dca72d468c4 namespace/openshift-authentication backend-disruption-name/ingress-to-oauth-server-new-connections connection/new disruption/openshift-tests route/oauth-openshift stopped responding to GET requests over new connections: Get "https://oauth-openshift.apps.ci-op-229f34b6-caf63.ci2.azure.devcluster.openshift.com/healthz": dial tcp 13.89.116.81:443: i/o timeout
      Dec 18 09:55:15.536 - 999ms E namespace/openshift-authentication backend-disruption-name/ingress-to-oauth-server-new-connections connection/new disruption/openshift-tests route/oauth-openshift reason/DisruptionBegan request-audit-id/4a3b526d-4098-4398-a708-29ee2b238530 namespace/openshift-authentication backend-disruption-name/ingress-to-oauth-server-new-connections connection/new disruption/openshift-tests route/oauth-openshift stopped responding to GET requests over new connections: Get "https://oauth-openshift.apps.ci-op-229f34b6-caf63.ci2.azure.devcluster.openshift.com/healthz": dial tcp 13.89.116.81:443: i/o timeout}
      

      That 443: i/o timeout disruption is brief, so likely not a huge customer issue, but still something we would like to polish off to deliver zero-disruption updates.

      Version-Release number of selected component (if applicable):

      Searching CI:

      $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=dial+tcp+[0-9.]*:443:+i/o+timeout' | grep '[0-9][0-9] runs.*failures match' | sort
      periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-gcp-ovn-heterogeneous (all) - 30 runs, 20% failed, 17% of failures match = 3% impact
      periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-gcp-ovn-heterogeneous (all) - 30 runs, 17% failed, 40% of failures match = 7% impact
      periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-sdn-serial-aws-arm64 (all) - 29 runs, 97% failed, 4% of failures match = 3% impact
      periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 15 runs, 80% failed, 8% of failures match = 7% impact
      periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade (all) - 70 runs, 21% failed, 7% of failures match = 1% impact
      periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade (all) - 49 runs, 41% failed, 15% of failures match = 6% impact
      periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 50 runs, 54% failed, 4% of failures match = 2% impact
      periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 49 runs, 47% failed, 26% of failures match = 12% impact
      periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade (all) - 70 runs, 44% failed, 58% of failures match = 26% impact
      periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade (all) - 78 runs, 68% failed, 9% of failures match = 6% impact
      periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade (all) - 79 runs, 15% failed, 17% of failures match = 3% impact
      periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-azure-sdn-upgrade (all) - 80 runs, 25% failed, 5% of failures match = 1% impact
      periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-gcp-ovn-rt-upgrade (all) - 60 runs, 68% failed, 2% of failures match = 2% impact
      periodic-ci-openshift-release-master-nightly-4.16-e2e-vsphere-sdn (all) - 20 runs, 90% failed, 11% of failures match = 10% impact
      periodic-ci-redhat-openshift-ecosystem-cvp-ocp-4.14-cvp-common-claim (all) - 47 runs, 2% failed, 100% of failures match = 2% impact
      periodic-ci-shiftstack-shiftstack-ci-main-monitor-mecha-central (all) - 476 runs, 3% failed, 6% of failures match = 0% impact
      periodic-ci-shiftstack-shiftstack-ci-main-monitor-vexxhost (all) - 465 runs, 3% failed, 14% of failures match = 0% impact
      pull-ci-openshift-assisted-service-master-e2e-agent-compact-ipv4 (all) - 28 runs, 64% failed, 6% of failures match = 4% impact
      pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn (all) - 10 runs, 60% failed, 17% of failures match = 10% impact
      pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade (all) - 12 runs, 50% failed, 17% of failures match = 8% impact
      release-openshift-origin-installer-launch-azure-modern (all) - 11 runs, 36% failed, 25% of failures match = 9% impact
      

      So almost exclusively seen in update jobs, although there are some serial and other hits too. And seems like it's mostly minor-version updates like 4.14 to 4.15, and 4.15 to 4.16, although also seen in patch updates within 4.16 and such as well.

      Unclear if the lack of hits from older 4.y are because the disruption suite wasn't sophisticated enough to pick this up, or if older 4.y are actually not affected, or if their 2d run counts are just low enough that they didn't meet my "at least 2 digit run count" filter.

      How reproducible:

      The 6% impact for periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade is probably a good benchmark for the high-run-count, 4.16-touching jobs.

      Steps to Reproduce:

      1. Run tens of periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade runs.
      2. Check JUnit from those runs for mentions of 443: i/o timeout.

      Actual results:

      Around 6% impact.

      Expected results:

      Zero impact.

      Additional info:

      Looks like we don't have 4.16 numbers in the origin disruption results yet:

      $ jq -r '.[] | .Release' pkg/monitortestlibrary/allowedbackenddisruption/query_results.json | sort | uniq -c
         1868 4.14
         2002 4.15
      

      But checking the 4.15 numbers for highly-available OVN on GCP (selected to get a limited matches similar to periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade):

      $ jq -r '.[] | select(.Release == "4.15" and .Topology == "ha" and .MasterNodesUpdated == "Y" and .Platform == "gcp" and .Network == "ovn" and (.BackendName | endswith("-new-connections")) and .JobRuns > 20 and .P99 != "0.0") | .P99 + " " + tostring' pkg/monitortestlibrary/allowedbackenddisruption/query_results.json | sort -n
      0.45 {"BackendName":"ci-cluster-network-liveness-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":834,"P95":"0.0","P99":"0.45","P75":"0.0","P50":"0.0"}
      1.0 {"BackendName":"ingress-to-console-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":834,"P95":"1.0","P99":"1.0","P75":"0.0","P50":"0.0"}
      1.0 {"BackendName":"kube-api-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":834,"P95":"0.0","P99":"1.0","P75":"0.0","P50":"0.0"}
      2.0 {"BackendName":"cache-kube-api-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":834,"P95":"0.0","P99":"2.0","P75":"0.0","P50":"0.0"}
      2.0 {"BackendName":"cache-kube-api-new-connections","Release":"4.15","FromRelease":"4.15","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":1212,"P95":"0.0","P99":"2.0","P75":"0.0","P50":"0.0"}
      2.0 {"BackendName":"host-to-service-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":829,"P95":"0.0","P99":"2.0","P75":"0.0","P50":"0.0"}
      2.0 {"BackendName":"host-to-service-new-connections","Release":"4.15","FromRelease":"4.15","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":1212,"P95":"0.0","P99":"2.0","P75":"0.0","P50":"0.0"}
      2.0 {"BackendName":"ingress-to-console-new-connections","Release":"4.15","FromRelease":"4.15","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":1212,"P95":"1.0","P99":"2.0","P75":"0.0","P50":"0.0"}
      2.0 {"BackendName":"ingress-to-oauth-server-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":834,"P95":"1.0","P99":"2.0","P75":"0.0","P50":"0.0"}
      2.0 {"BackendName":"kube-api-new-connections","Release":"4.15","FromRelease":"4.15","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":1212,"P95":"0.0","P99":"2.0","P75":"0.0","P50":"0.0"}
      2.0 {"BackendName":"pod-to-service-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":829,"P95":"0.0","P99":"2.0","P75":"0.0","P50":"0.0"}
      2.0 {"BackendName":"pod-to-service-new-connections","Release":"4.15","FromRelease":"4.15","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":1212,"P95":"0.0","P99":"2.0","P75":"0.0","P50":"0.0"}
      3.0 {"BackendName":"image-registry-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":834,"P95":"2.0","P99":"3.0","P75":"0.0","P50":"0.0"}
      3.0 {"BackendName":"image-registry-new-connections","Release":"4.15","FromRelease":"4.15","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":1212,"P95":"2.0","P99":"3.0","P75":"1.0","P50":"0.0"}
      3.0 {"BackendName":"ingress-to-oauth-server-new-connections","Release":"4.15","FromRelease":"4.15","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":1212,"P95":"1.0","P99":"3.0","P75":"0.0","P50":"0.0"}
      3.53 {"BackendName":"service-load-balancer-with-pdb-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":826,"P95":"1.0","P99":"3.53","P75":"0.0","P50":"0.0"}
      4.0 {"BackendName":"service-load-balancer-with-pdb-new-connections","Release":"4.15","FromRelease":"4.15","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":1212,"P95":"1.0","P99":"4.0","P75":"0.0","P50":"0.0"}
      8.0 {"BackendName":"host-to-pod-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":834,"P95":"2.0","P99":"8.0","P75":"0.0","P50":"0.0"}
      10.0 {"BackendName":"host-to-pod-new-connections","Release":"4.15","FromRelease":"4.15","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":1212,"P95":"6.0","P99":"10.0","P75":"0.0","P50":"0.0"}
      57.0 {"BackendName":"cache-openshift-api-new-connections","Release":"4.15","FromRelease":"4.15","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":1212,"P95":"1.0","P99":"57.0","P75":"0.0","P50":"0.0"}
      57.0 {"BackendName":"oauth-api-new-connections","Release":"4.15","FromRelease":"4.15","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":1212,"P95":"1.0","P99":"57.0","P75":"0.0","P50":"0.0"}
      61.3 {"BackendName":"cache-oauth-api-new-connections","Release":"4.15","FromRelease":"4.15","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":1212,"P95":"1.0","P99":"61.3","P75":"0.0","P50":"0.0"}
      62.5 {"BackendName":"openshift-api-new-connections","Release":"4.15","FromRelease":"4.15","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":1212,"P95":"1.0","P99":"62.5","P75":"0.0","P50":"0.0"}
      72.0 {"BackendName":"host-to-host-new-connections","Release":"4.15","FromRelease":"4.15","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":1212,"P95":"0.0","P99":"72.0","P75":"0.0","P50":"0.0"}
      74.0 {"BackendName":"pod-to-host-new-connections","Release":"4.15","FromRelease":"4.15","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":1212,"P95":"0.0","P99":"74.0","P75":"0.0","P50":"0.0"}
      118.0 {"BackendName":"host-to-host-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":834,"P95":"0.0","P99":"118.0","P75":"0.0","P50":"0.0"}
      118.0 {"BackendName":"pod-to-host-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":834,"P95":"1.0","P99":"118.0","P75":"0.0","P50":"0.0"}
      178.9 {"BackendName":"oauth-api-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":834,"P95":"1.0","P99":"178.9","P75":"0.0","P50":"0.0"}
      194.25 {"BackendName":"cache-oauth-api-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":834,"P95":"1.0","P99":"194.25","P75":"0.0","P50":"0.0"}
      208.0 {"BackendName":"openshift-api-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":834,"P95":"1.0","P99":"208.0","P75":"0.0","P50":"0.0"}
      208.25 {"BackendName":"cache-openshift-api-new-connections","Release":"4.15","FromRelease":"4.14","Platform":"gcp","Architecture":"amd64","Network":"ovn","Topology":"ha","MasterNodesUpdated":"Y","JobRuns":834,"P95":"1.0","P99":"208.25","P75":"0.0","P50":"0.0"}
      

      The higher-disruption backends are probably something else, but the backends with p99 disruptions in the second range could all be because of this DNS instability.

              mmasters1@redhat.com Miciah Masters
              trking W. Trevor King
              Melvin Joseph Melvin Joseph
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated:
                Resolved: