Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-42664

[4.15] etcd vertical scaling test should not rely on CPMS status.readyReplicas

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.14, 4.15, 4.16, 4.17
    • Etcd
    • Important
    • None
    • 1
    • ETCD Sprint 260, ETCD Sprint 261, ETCD Sprint 262
    • 3
    • False
    • Hide

      None

      Show
      None
    • Release Note Not Required
    • In Progress

      This is a clone of issue OCPBUGS-38086. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-38015. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-37837. The following is the description of the original issue:

      In our vertical scaling test, after we delete a machine, we rely on the `status.readyReplicas` field of the ControlPlaneMachineSet (CPMS) to indicate that it has successfully created a new machine that let's us scale up before we scale down.
      https://github.com/openshift/origin/blob/3deedee4ae147a03afdc3d4ba86bc175bc6fc5a8/test/extended/etcd/vertical_scaling.go#L76-L87

      As we've seen in the past as well, that status field isn't a reliable indicator of the scale up of machines, as status.readyReplicas might stay at 3 as the soon-to-be-removed node that is pending deletion can go  Ready=Unknown in runs such as the following: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1286/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-etcd-scaling/1808186565449486336

      Which then ends up the test timing out on waiting for status.readyReplicas=4 while the scale-up and down may already have happened.
      This shows up across scaling tests on all platforms as:

      fail [github.com/openshift/origin/test/extended/etcd/vertical_scaling.go:81]: Unexpected error:
          <*errors.withStack | 0xc002182a50>: 
          scale-up: timed out waiting for CPMS to show 4 ready replicas: timed out waiting for the condition
          {
              error: <*errors.withMessage | 0xc00304c3a0>{
                  cause: <wait.errInterrupted>{
                      cause: <*errors.errorString | 0xc0003ca800>{
                          s: "timed out waiting for the condition",
                      },
                  },
                  msg: "scale-up: timed out waiting for CPMS to show 4 ready replicas",
              }, 

      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-azure-ovn-etcd-scaling/1811686448848441344

      https://sippy.dptools.openshift.org/sippy-ng/jobs/4.17?filters=%257B%2522items%2522%253A%255B%257B%2522columnField%2522%253A%2522name%2522%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522etcd-scaling%2522%257D%255D%252C%2522linkOperator%2522%253A%2522and%2522%257D&sort=asc&sortField=net_improvement

      In hindsight all we care about is whether the deleted machine's member is replaced by another machine's member and can ignore the flapping of node and machine statuses while we wait for the scale-up then down of members to happen. So we can relax or replace that check on status.readyReplicas with just looking at the membership change.

      PS: We can also update the outdated Godoc comments for the test to mention that it relies on CPMSO to create a machine for us https://github.com/openshift/origin/blob/3deedee4ae147a03afdc3d4ba86bc175bc6fc5a8/test/extended/etcd/vertical_scaling.go#L34-L38

            rhn-coreos-htariq Haseeb Tariq
            openshift-crt-jira-prow OpenShift Prow Bot
            Ge Liu Ge Liu
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: