• Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Minor Minor
    • 4.16.0
    • 4.13.z, 4.12.z, 4.14.z, 4.15.z, 4.16.0
    • Etcd
    • None
    • No
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, the etcd Cluster Operator entered a state of panic during pod health checks and this caused requests to an `etcd` cluster to fail. With this release, the issue is resovled so that these panic situatuions no longer occur.(link:https://issues.redhat.com/browse/OCPBUGS-27959[*OCPBUGS-27959*])
      Show
      * Previously, the etcd Cluster Operator entered a state of panic during pod health checks and this caused requests to an `etcd` cluster to fail. With this release, the issue is resovled so that these panic situatuions no longer occur.(link: https://issues.redhat.com/browse/OCPBUGS-27959 [* OCPBUGS-27959 *])
    • Bug Fix
    • Done

      In a CI run of etcd-operator-e2e I've found the following panic in the operator logs:

      E0125 11:04:58.158222       1 health.go:135] health check for member (ip-10-0-85-12.us-west-2.compute.internal) failed: err(context deadline exceeded)
      panic: send on closed channel
      
      goroutine 15608 [running]:
      github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth.func1()
      	github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:58 +0xd2
      created by github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth
      	github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:54 +0x2a5
      
      

      which unfortunately is an incomplete log file. The operator recovered itself by restarting, we should fix the panic nonetheless.

      Job run for reference:
      https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1186/pull-ci-openshift-cluster-etcd-operator-master-e2e-operator/1750466468031500288

            [OCPBUGS-27959] Panic: send on closed channel

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory (Critical: OpenShift Container Platform 4.16.0 bug fix and security update), and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHSA-2024:0041

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Critical: OpenShift Container Platform 4.16.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:0041

            geliu this is a technicality that's difficult to reproduce for you, I've set it to verified and will monitor the CI logs going forward.

            Thomas Jungblut added a comment - geliu this is a technicality that's difficult to reproduce for you, I've set it to verified and will monitor the CI logs going forward.

            Looks like this bug is far enough along in the workflow that a code fix is ready. Customers and support need to know the backport plan. Please complete the "Target Backport Versions" field to indicate which version(s) will receive the fix.

            OpenShift Jira Bot added a comment - Looks like this bug is far enough along in the workflow that a code fix is ready. Customers and support need to know the backport plan. Please complete the " Target Backport Versions " field to indicate which version(s) will receive the fix.

            https://github.com/openshift/cluster-etcd-operator/blob/1e1970f23b7a588953ad47617108a683ecc4f30f/pkg/etcdcli/health.go#L61-L72

            When we hit the timeout, we close the channel but the goroutine might still continue and then attempt to send to a closed resChan.

            Thomas Jungblut added a comment - https://github.com/openshift/cluster-etcd-operator/blob/1e1970f23b7a588953ad47617108a683ecc4f30f/pkg/etcdcli/health.go#L61-L72 When we hit the timeout, we close the channel but the goroutine might still continue and then attempt to send to a closed resChan.

              tjungblu@redhat.com Thomas Jungblut
              tjungblu@redhat.com Thomas Jungblut
              Ge Liu Ge Liu
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: