Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-43379

etcd-scaling jobs failing ~60% of the time

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.18.0
    • Etcd
    • Critical
    • None
    • 5
    • ETCD Sprint 263
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None

      While designing a solution to have these rarely run jobs included in component readiness, I discovered the etcd-scaling job is quite broken for some time. The problem seems to be invariant tests looking for "unexpected" things happening in the cluster.

      It's possible some or all of these boil down to "this is expected during an etcd scaling operation" if a strong case can be made.

      [bz-kube-storage-version-migrator] clusteroperator/kube-storage-version-migrator should not change condition/Available
      

      This one seems very common, examples:
      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling/1844042416286339072
      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling/1841505588492636160
      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-etcd-scaling/1831358169155112960

      [bz-OLM] clusteroperator/operator-lifecycle-manager-packageserver should not change condition/Available
      

      Examples:

      [bz-etcd][invariant] alert/etcdMembersDown should not be at or above info
      

      Examples:

      [sig-node] node-lifecycle detects unexpected not ready node
      [sig-node] node-lifecycle detects unreachable state on node
      

      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-etcd-scaling/1828821429399851008

      It's likely more examples could be found here.

      A lot to unravel here, but is it acceptable for operators (seemingly several) to go Available=False (a serious condition that would often result in someone getting alerted) during an etcd scaling operation?
      Same question for unreachable nodes, and etcd member down alerts.

              rh-ee-jujohn Jubitta John
              rhn-engineering-dgoodwin Devan Goodwin
              Ge Liu Ge Liu
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: