Loading...

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: Logging 5.8.13
Component/s: Log Storage
Labels:
- devel_ack-

Blocked:
False
Blocked Reason:
None
Ready:
False
Docs QE Status:
NEW
QE Status:
NEW
Release Note Type:
Bug Fix
Risk Impact Level:
High
Intelligence Requested:
Market:

Severity:
Critical

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Description of problem:

Elasticsearch Operator upgrade is stuck `timed out waiting for node to rollout` upgrading to the Red Hat Elasticsearch Operator 5.8.13.

The Elasticsearch CR is showing the message:

  nodes:
  - deploymentName: elasticsearch-cdm-apmko5qx-1
    upgradeStatus:
      scheduledUpgrade: "True"
      underUpgrade: "True"
      upgradePhase: preparationComplete
  - deploymentName: elasticsearch-cdm-apmko5qx-2
    upgradeStatus:
      scheduledUpgrade: "True"
      upgradePhase: controllerUpdated
  - deploymentName: elasticsearch-cdm-apmko5qx-3
    upgradeStatus:
      scheduledUpgrade: "True"
      upgradePhase: controllerUpdated

As the Elasticsearch Operator is trying to rollout the deployments to use the new images provided with the Elasticsearch Operator 5.8.13, then, the deployments are PAUSED and the shardAllocation is `primaries`:

// shard allocation
  shardAllocationEnabled: primaries

// deployments in Paused
$ oc get deployments -l component=elasticsearch  -o yaml |grep -i pause
    paused: true
      message: Deployment is paused
      reason: DeploymentPaused
    paused: true
      message: Deployment is paused
      reason: DeploymentPaused

The Elasticsearch operator pod is always returning the error `timed out waiting for node to rollout`:

$ oc logs elasticsearch-operator-694b9889ff-f79j5 -n openshift-operators-redhat
...
2024-10-27T12:31:25.500535646Z {"_ts":"2024-10-27T12:31:25.50046439Z","_level":"0","_component":"elasticsearch-operator_controllers_Elasticsearch","_message":"unable to update node","_error":{"msg":"timed out waiting for node to rollout","node":"elasticsearch-cdm-apmko5qx-1"},"cluster":"elasticsearch","namespace":"openshift-logging"}
2024-10-27T12:31:56.116689158Z {"_ts":"2024-10-27T12:31:56.116614221Z","_level":"0","_component":"elasticsearch-operator_controllers_Elasticsearch","_message":"unable to update node","_error":{"msg":"timed out waiting for node to rollout","node":"elasticsearch-cdm-apmko5qx-1"},"cluster":"elasticsearch","namespace":"openshift-logging"}

But, the Elasticsearch pod is up and joined to the Elasticsearch cluster:

$ oc get pods -l component=elasticsearch
NAME                                            READY   STATUS    RESTARTS   AGE
elasticsearch-cdm-apmko5qx-1-6f54cdcc65-rv5xj   2/2     Running   0          1d
elasticsearch-cdm-apmko5qx-2-854c6bd8c-rp8gv    2/2     Running   0          29d
elasticsearch-cdm-apmko5qx-3-59c99c9d66-rtlcw   2/2     Running   0          29d

Currently, some Elasticsearch pods are using the old version of the Elasticsearch Operator and some the new version:

$ oc get pods -l component=elasticsearch -o yaml |grep image:
      image: registry.redhat.io/openshift-logging/elasticsearch6-rhel9@sha256:89629964e44058bafc90393a235b08c7c974f05513aecbfe7229134da732f2b5
      image: registry.redhat.io/openshift-logging/elasticsearch-proxy-rhel9@sha256:a8f53dec42a46c5bf8ac7f2888848c01e70f405d27211e2a269730c888929faf
      image: registry.redhat.io/openshift-logging/elasticsearch6-rhel9@sha256:89629964e44058bafc90393a235b08c7c974f05513aecbfe7229134da732f2b5
      image: registry.redhat.io/openshift-logging/elasticsearch-proxy-rhel9@sha256:a8f53dec42a46c5bf8ac7f2888848c01e70f405d27211e2a269730c888929faf
      image: registry.redhat.io/openshift-logging/elasticsearch6-rhel9@sha256:bc97b8e13087050dfb1a75b02c5b780cbb8fb12a849a655169d072ba8bbf42b4
      image: registry.redhat.io/openshift-logging/elasticsearch-proxy-rhel9@sha256:1d1d7d47b616995d18692f3a2d6232d7e1a0f41bc1503d87100f31939e080a78
      image: registry.redhat.io/openshift-logging/elasticsearch6-rhel9@sha256:bc97b8e13087050dfb1a75b02c5b780cbb8fb12a849a655169d072ba8bbf42b4
      image: registry.redhat.io/openshift-logging/elasticsearch-proxy-rhel9@sha256:1d1d7d47b616995d18692f3a2d6232d7e1a0f41bc1503d87100f31939e080a78
      image: registry.redhat.io/openshift-logging/elasticsearch6-rhel9@sha256:bc97b8e13087050dfb1a75b02c5b780cbb8fb12a849a655169d072ba8bbf42b4
      image: registry.redhat.io/openshift-logging/elasticsearch-proxy-rhel9@sha256:1d1d7d47b616995d18692f3a2d6232d7e1a0f41bc1503d87100f31939e080a78
      image: registry.redhat.io/openshift-logging/elasticsearch6-rhel9@sha256:bc97b8e13087050dfb1a75b02c5b780cbb8fb12a849a655169d072ba8bbf42b4
      image: registry.redhat.io/openshift-logging/elasticsearch-proxy-rhel9@sha256:1d1d7d47b616995d18692f3a2d6232d7e1a0f41bc1503d87100f31939e080a78

VERIFICATIONS DONE

The connectivity between the Elasticsearch Operator pod and the Elasticsearch service is correct and returns the results for Elasticsearch queries
The connectivity between the Elasticsearch Operator pod and the Elasticsearch pods is correct and returns the results for Elasticsearch queries
Not constraints/issues at the node level/network/storage level
If set to Unmanaged the Elasticsearch Operator and the Cluster Logging Operator the Elasticsearch cluster recovers and gets GREEN and Healthy

Version-Release number of selected component (if applicable):

Elasticsearch Operator 5.8.13

How reproducible:

Not able to reproduce

Actual results:

Currently, the Elasticsearch Operator is an inconsistent status where not able to finish the upgrade and some Elasticsearch pods are using the new image and some the old.

When the Cluster Logging Operator and the Elasticsearch Operators are moved to `Unmanaged` and set the enable the allocation to `all` with the command below, the Elasticsearch cluster recovers and gets Healthy and in Green Status showing that the error is not in the Elasticsearch pod level

$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \
  -- es_util --query=_cluster/settings?pretty \
  -X PUT -d '{"persistent": {"cluster.routing.allocation.enable":"all"}}'

If the Cluster Logging and the Elasticsearch Operator are moved to "Managed", the Operator will try again a rollout as it seems that it's in an inconsistent status for the Elasticsearch pod `elasticsearch-cdm-apmko5qx-1"`

Expected results:

Being able to move again the Cluster Logging Operator and the Elasticsearch Operator to `Managed` status finishing the rollout of the Elasticsearch pods for using the images provided for Elasticsearch 5.8.13 having the less disruption possible, this means, not having disruption of the service

links to

[KCS] Elasticsearch Operator upgrade is stuck `timed out waiting for node to rollout` in RHOCP 4

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Actual results:

Expected results:

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates