[OCPBUGS-27959] Panic: send on closed channel

Type: Bug
Resolution: Done-Errata
Priority: Minor
Fix Version/s: 4.16.0
Affects Version/s: 4.13.z, 4.12.z, 4.14.z, 4.15.z, 4.16.0
Component/s: Etcd
Labels:
None

Regression:
No
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
* Previously, the etcd Cluster Operator entered a state of panic during pod health checks and this caused requests to an `etcd` cluster to fail. With this release, the issue is resovled so that these panic situatuions no longer occur.(link:https://issues.redhat.com/browse/OCPBUGS-27959[*~~OCPBUGS-27959~~*])

Show
* Previously, the etcd Cluster Operator entered a state of panic during pod health checks and this caused requests to an `etcd` cluster to fail. With this release, the issue is resovled so that these panic situatuions no longer occur.(link: https://issues.redhat.com/browse/OCPBUGS-27959 [* OCPBUGS-27959 *])
Release Note Type:
Bug Fix
Release Note Status:
Done
Target Version:

4.16.0
Target Backport Versions:

4.13.z, 4.12.z, 4.14.z, 4.15.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

In a CI run of etcd-operator-e2e I've found the following panic in the operator logs:

E0125 11:04:58.158222       1 health.go:135] health check for member (ip-10-0-85-12.us-west-2.compute.internal) failed: err(context deadline exceeded)
panic: send on closed channel

goroutine 15608 [running]:
github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth.func1()
	github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:58 +0xd2
created by github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth
	github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:54 +0x2a5

which unfortunately is an incomplete log file. The operator recovered itself by restarting, we should fix the panic nonetheless.

Job run for reference:
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1186/pull-ci-openshift-cluster-etcd-operator-master-e2e-operator/1750466468031500288

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

openshift-etcd-operator_etcd-operator-6b96c5f978-dsktr_etcd-operator_previous.log
418 kB
2024/01/25 1:01 PM

blocks

OCPBUGS-28628 [4.15] Panic: send on closed channel

Closed

is cloned by

OCPBUGS-28628 [4.15] Panic: send on closed channel

Closed

is related to

OCPBUGS-36301 [4.17] Should run health checks in parallel to avoid spurious Available=False EtcdMembers_NoQuorum claims

Closed

links to

openshift/cluster-etcd-operator#1190: OCPBUGS-27959: fix panic in health check timeouts

RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update

Errata Tool added a comment - 2024/06/27 11:36 AM

Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

For information on the advisory (Critical: OpenShift Container Platform 4.16.0 bug fix and security update), and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2024:0041

Errata Tool added a comment - 2024/06/27 11:36 AM Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Critical: OpenShift Container Platform 4.16.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:0041

Thomas Jungblut added a comment - 2024/01/30 2:22 PM

geliu this is a technicality that's difficult to reproduce for you, I've set it to verified and will monitor the CI logs going forward.

Thomas Jungblut added a comment - 2024/01/30 2:22 PM geliu this is a technicality that's difficult to reproduce for you, I've set it to verified and will monitor the CI logs going forward.

OpenShift Jira Bot added a comment - 2024/01/30 8:48 AM

Looks like this bug is far enough along in the workflow that a code fix is ready. Customers and support need to know the backport plan. Please complete the "Target Backport Versions" field to indicate which version(s) will receive the fix.

OpenShift Jira Bot added a comment - 2024/01/30 8:48 AM Looks like this bug is far enough along in the workflow that a code fix is ready. Customers and support need to know the backport plan. Please complete the " Target Backport Versions " field to indicate which version(s) will receive the fix.

Thomas Jungblut added a comment - 2024/01/26 4:10 PM

https://github.com/openshift/cluster-etcd-operator/blob/1e1970f23b7a588953ad47617108a683ecc4f30f/pkg/etcdcli/health.go#L61-L72

When we hit the timeout, we close the channel but the goroutine might still continue and then attempt to send to a closed resChan.

Thomas Jungblut added a comment - 2024/01/26 4:10 PM https://github.com/openshift/cluster-etcd-operator/blob/1e1970f23b7a588953ad47617108a683ecc4f30f/pkg/etcdcli/health.go#L61-L72 When we hit the timeout, we close the channel but the goroutine might still continue and then attempt to send to a closed resChan.

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Errata Tool added a comment - 2024/06/27 11:36 AM

Expand comment: Errata Tool added a comment - 2024/06/27 11:36 AM

Collapse comment: Thomas Jungblut added a comment - 2024/01/30 2:22 PM

Expand comment: Thomas Jungblut added a comment - 2024/01/30 2:22 PM

Collapse comment: OpenShift Jira Bot added a comment - 2024/01/30 8:48 AM

Expand comment: OpenShift Jira Bot added a comment - 2024/01/30 8:48 AM

Collapse comment: Thomas Jungblut added a comment - 2024/01/26 4:10 PM

Expand comment: Thomas Jungblut added a comment - 2024/01/26 4:10 PM

People

Dates