Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-13543

Disruption for load balancer service that blocks both 4.14 and 4.13 payloads

    XMLWordPrintable

Details

    • Critical
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

    Description

      Recently, both 4.14 and 4.13 payloads have been blocked by this disruption variant: "service-load-balancer-with-pdb-new-connections/service-load-balancer-with-pdb-reused-connections xxxxxx should not be worse".

      The following chart shows the problem: https://grafana-loki.ci.openshift.org/d/ISnBj4LVk/disruption?from=1681226000299&to=1683818000299&var-platform=azure&var-platform=aws&var-platform=gcp&var-percentile=P50&var-backend=service-load-balancer-with-pdb-new-connections&var-backend=service-load-balancer-with-pdb-reused-connections&var-releases=4.13&var-from_releases=4.12&var-networks=ovn&var-topologies=ha&var-architectures=amd64&var-min_job_runs=10&orgId=1

      Our analysis indicates that there is a node life cycle behavior change that might be related to this regression. 4.14 data and 4.13 data are very similar. I am going to use 4.13 data for detailed description below since there were fewer changes on 4.13 branch.

      Take 4.13 CI as an example

      Good payload: https://sippy.dptools.openshift.org/sippy-ng/release/4.13/tags/4.13.0-0.ci-2023-05-06-033438

      Problem payload: https://sippy.dptools.openshift.org/sippy-ng/release/4.13/tags/4.13.0-0.ci-2023-05-08-161719

      But we do not see any PRs in the problem payload!

      Analysis of the problem indicates reporting of NodeNotReady status seems to be related to the issue. From the following example job runs, if you expand on the first interval chart and go to the node-state section, you will see that, for the worker nodes, the yellow bar that indicates node not ready is narrow and lasted about 20s for the problem job. In comparison, for the good payload, this bar is typically 2 minutes!

      This is a job from the good payload: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-aws-ovn-upgrade/1654691394544996352

      For comparison, here is a job from the problem payload: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-aws-ovn-upgrade/1655609894633476096

      The CCM (4.14) or in tree kube-controller-manager (4.13) removes the instance from the load balancer based on this status:

      For example from this log: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-aws-ovn-upgrade/1655609894633476096/artifacts/e2e-aws-ovn-upgrade/gather-extra/artifacts/pods/openshift-kube-controller-manager_kube-controller-manager-ip-10-0-132-208.us-west-2.compute.internal_kube-controller-manager.log

      I0508 18:31:52.138738 1 aws_loadbalancer.go:1483] Instances removed from load-balancer a24bcbae1386d4d6ca8c710cc358322d

      Also in the bad payload run, based on journal log, the short 20s "node not ready" seems to start after a node is rebooted. But in the good payload run, the event started before reboot start (typically over 1 minute).

      We are suspecting that, since the node readiness is not changed sooner, and therefore instance is not removed from the load balancer, it causes the disruption we are seeing.

      What is odd is that this problem seemed to exist before and somehow it was fixed on 4/20/2023. But then it is broken again between the above example payloads.

      Please ping TRT if any more information is needed.

      Attachments

        1. image-2023-05-11-15-10-13-466.png
          120 kB
          Forrest Babcock
        2. image-2023-05-11-15-12-02-986.png
          73 kB
          Forrest Babcock
        3. image-2023-05-11-15-18-48-807.png
          68 kB
          Forrest Babcock
        4. screenshot-1.png
          91 kB
          Forrest Babcock

        Activity

          People

            rh-ee-fbabcock Forrest Babcock
            kenzhang@redhat.com Ken Zhang
            Sunil Choudhary Sunil Choudhary
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: