Loading...

XML

Word

Printable

Type: Bug
Resolution: Duplicate
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.14, 4.19
Component/s: Etcd
Labels:
- pmr-ai

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Moderate
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

In a live 4.14.35 production cluster working through a control-plane replacement, where administrators deleted Machines, and the ControlPlaneMachineSet controller worked with the etcd operator to bring in replacement, etcdMembersDown fired for one of the outgoing instances. But that member's exit was expected, and the alarming alert firing during the healthy removal caused some concern, and manual checks to confirm that the removal was in fact proceeding without issue. We should adjust the alert to not fire in these smooth control-plane replacements, to avoid distracting future administrators.

Version-Release number of selected component

Seen in a 4.14.35 cluster, but also turns up in dev/4.19 CI:

$ w3m -dump -cols 200 'https://search.dptools.openshift.org/?search=etcdMembersDown.*firing+for+%5B%5E0%5D&maxAge=24h&type=junit' | grep 'failures match'
pull-ci-openshift-cluster-control-plane-machine-set-operator-main-e2e-gcp-ovn-etcd-scaling (all) - 3 runs, 67% failed, 100% of failures match = 67% impact
pull-ci-openshift-cluster-control-plane-machine-set-operator-main-e2e-vsphere-ovn-etcd-scaling (all) - 3 runs, 67% failed, 50% of failures match = 33% impact
periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling (all) - 1 runs, 100% failed, 100% of failures match = 100% impact

How reproducible

According to my CI Search results, something like half of all etcd-scaling CI reproduce this alert issue.

Steps to Reproduce

1. Run some etcd-scaling CI.
2. Check the alert/etcdMembersDown should not be at or above info test-case.

Actual results

Sometimes the test-case passes, but about half the time it fails because the alert fires. For example, in this run:

: [bz-etcd][invariant] alert/etcdMembersDown should not be at or above info expand_less	0s
{  etcdMembersDown was at or above info for at least 1m58s on platformidentification.JobType{Release:"4.19", FromRelease:"", Platform:"gcp", Architecture:"amd64", Network:"ovn", Topology:"ha"} (maxAllowed=0s): pending for 14m34s, firing for 1m58s:

Dec 20 17:05:27.811 - 118s  E namespace/openshift-etcd pod/etcd-ci-op-4cyd0vfv-c7885-s78z5-master-1 alert/etcdMembersDown alertstate/firing severity/critical ALERTS{alertname="etcdMembersDown", alertstate="firing", job="etcd", namespace="openshift-etcd", pod="etcd-ci-op-4cyd0vfv-c7885-s78z5-master-1", prometheus="openshift-monitoring/k8s", service="etcd", severity="critical"}}

Expected results

Test-case passes reliably, because the alert doesn't fire.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

screenshot-3.png
179 kB
2024/12/20 11:19 PM
screenshot-2.png
124 kB
2024/12/20 11:11 PM
screenshot-1.png
232 kB
2024/12/20 11:10 PM

is duplicated by

OCPBUGS-58386 etcdMembersDown firing when CPMS is replacing machines

Assignee:: Dean West

Reporter:: W. Trevor King

Need Info From:: None

Contributors:: None

QA Contact:: Ge Liu

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2024/12/20 10:57 PM

Updated:: 2025/07/21 6:01 AM

Resolved:: 2025/07/21 6:01 AM

Details

Description

Description of problem:

Version-Release number of selected component

How reproducible

Steps to Reproduce

Actual results

Expected results

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates