Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-35969

'http2: client connection lost' flakes in apiserver calls

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.17
    • Test Framework
    • None
    • Low
    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      This card does not describe a single problem that can be easily fixed, but captures relevant information about a (fortunately quite rare) flake that can show up in different testcases.

      Testcases doing apiserver calls via client-go clients (many testcases) can sometimes fail when these calls error-out with http2: client connection lost-kind of error. These errors are not retried (I assume there is a good reason why neither Go http code or client-go code retries this error - I guess that's because the server side operation may have actually succeeded and retry would double call, possibly hitting a ErrNotExists after e.g. a retried delete?) so there are rare occasions where this error happens on a cleanup-related DELETE or a common GET, which then fails the whole tests (which could actually succeed if the test retried the operation).

      This is not an easy problem to solve; this is technically a disruption bug (apiserver operations should not fail, at least too much), so being too robust in a testsuite shuts down a potentially valuable signal about disruption. The http2: client connection lost is essentially a client side timeout that kills the connection on no ping response after 15s (by default), so I have a hypothesis that this is related to low performance problems observed for Azure clusters.

      It seems to be rare enough to not cause issues for individual tests, but there are backstop tests that inherit other test failures, and in the past we observed component readiness regressions in such tests:

      [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
      [Jira:"Cluster Version Operator"] monitor test required-scc-annotation-checker collection
      

      Slack conversations

      Examples

      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade/1802942337257574400

      Get "https://api.ci-op-zqpjp81t-6c39f.ci2.azure.devcluster.openshift.com:6443/api/v1/namespaces/e2e-check-for-dns-availability-8309/pods?labelSelector=app%3Ddns-test-a3dcf3f8-0bc4-4ce0-b295-3ccdbaa3f5ed": http2: client connection lost
      

      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-azure-ovn-upgrade/1801770961364586496

      Delete "https://api.ci-op-inbxkqhx-6c39f.ci2.azure.devcluster.openshift.com:6443/api/v1/namespaces/e2e-image-pulls-are-fast-1163": http2: client connection lost
      
      Get "https://api.ci-op-rwi91g7v-6c39f.ci2.azure.devcluster.openshift.com:6443/apis/apps/v1/namespaces/e2e-k8s-sig-apps-replicaset-upgrade-8951/replicasets/rs": http2: client connection lost
      

            rhn-engineering-dgoodwin Devan Goodwin
            afri@afri.cz Petr Muller
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: