-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Logging 5.8.13
-
False
-
None
-
False
-
NEW
-
NEW
-
Bug Fix
-
High
-
-
-
Critical
Description of problem:
Elasticsearch Operator upgrade is stuck `timed out waiting for node to rollout` upgrading to the Red Hat Elasticsearch Operator 5.8.13.
The Elasticsearch CR is showing the message:
nodes: - deploymentName: elasticsearch-cdm-apmko5qx-1 upgradeStatus: scheduledUpgrade: "True" underUpgrade: "True" upgradePhase: preparationComplete - deploymentName: elasticsearch-cdm-apmko5qx-2 upgradeStatus: scheduledUpgrade: "True" upgradePhase: controllerUpdated - deploymentName: elasticsearch-cdm-apmko5qx-3 upgradeStatus: scheduledUpgrade: "True" upgradePhase: controllerUpdated
As the Elasticsearch Operator is trying to rollout the deployments to use the new images provided with the Elasticsearch Operator 5.8.13, then, the deployments are PAUSED and the shardAllocation is `primaries`:
// shard allocation shardAllocationEnabled: primaries // deployments in Paused $ oc get deployments -l component=elasticsearch -o yaml |grep -i pause paused: true message: Deployment is paused reason: DeploymentPaused paused: true message: Deployment is paused reason: DeploymentPaused
The Elasticsearch operator pod is always returning the error `timed out waiting for node to rollout`:
$ oc logs elasticsearch-operator-694b9889ff-f79j5 -n openshift-operators-redhat ... 2024-10-27T12:31:25.500535646Z {"_ts":"2024-10-27T12:31:25.50046439Z","_level":"0","_component":"elasticsearch-operator_controllers_Elasticsearch","_message":"unable to update node","_error":{"msg":"timed out waiting for node to rollout","node":"elasticsearch-cdm-apmko5qx-1"},"cluster":"elasticsearch","namespace":"openshift-logging"} 2024-10-27T12:31:56.116689158Z {"_ts":"2024-10-27T12:31:56.116614221Z","_level":"0","_component":"elasticsearch-operator_controllers_Elasticsearch","_message":"unable to update node","_error":{"msg":"timed out waiting for node to rollout","node":"elasticsearch-cdm-apmko5qx-1"},"cluster":"elasticsearch","namespace":"openshift-logging"}
But, the Elasticsearch pod is up and joined to the Elasticsearch cluster:
$ oc get pods -l component=elasticsearch NAME READY STATUS RESTARTS AGE elasticsearch-cdm-apmko5qx-1-6f54cdcc65-rv5xj 2/2 Running 0 1d elasticsearch-cdm-apmko5qx-2-854c6bd8c-rp8gv 2/2 Running 0 29d elasticsearch-cdm-apmko5qx-3-59c99c9d66-rtlcw 2/2 Running 0 29d
Currently, some Elasticsearch pods are using the old version of the Elasticsearch Operator and some the new version:
$ oc get pods -l component=elasticsearch -o yaml |grep image: image: registry.redhat.io/openshift-logging/elasticsearch6-rhel9@sha256:89629964e44058bafc90393a235b08c7c974f05513aecbfe7229134da732f2b5 image: registry.redhat.io/openshift-logging/elasticsearch-proxy-rhel9@sha256:a8f53dec42a46c5bf8ac7f2888848c01e70f405d27211e2a269730c888929faf image: registry.redhat.io/openshift-logging/elasticsearch6-rhel9@sha256:89629964e44058bafc90393a235b08c7c974f05513aecbfe7229134da732f2b5 image: registry.redhat.io/openshift-logging/elasticsearch-proxy-rhel9@sha256:a8f53dec42a46c5bf8ac7f2888848c01e70f405d27211e2a269730c888929faf image: registry.redhat.io/openshift-logging/elasticsearch6-rhel9@sha256:bc97b8e13087050dfb1a75b02c5b780cbb8fb12a849a655169d072ba8bbf42b4 image: registry.redhat.io/openshift-logging/elasticsearch-proxy-rhel9@sha256:1d1d7d47b616995d18692f3a2d6232d7e1a0f41bc1503d87100f31939e080a78 image: registry.redhat.io/openshift-logging/elasticsearch6-rhel9@sha256:bc97b8e13087050dfb1a75b02c5b780cbb8fb12a849a655169d072ba8bbf42b4 image: registry.redhat.io/openshift-logging/elasticsearch-proxy-rhel9@sha256:1d1d7d47b616995d18692f3a2d6232d7e1a0f41bc1503d87100f31939e080a78 image: registry.redhat.io/openshift-logging/elasticsearch6-rhel9@sha256:bc97b8e13087050dfb1a75b02c5b780cbb8fb12a849a655169d072ba8bbf42b4 image: registry.redhat.io/openshift-logging/elasticsearch-proxy-rhel9@sha256:1d1d7d47b616995d18692f3a2d6232d7e1a0f41bc1503d87100f31939e080a78 image: registry.redhat.io/openshift-logging/elasticsearch6-rhel9@sha256:bc97b8e13087050dfb1a75b02c5b780cbb8fb12a849a655169d072ba8bbf42b4 image: registry.redhat.io/openshift-logging/elasticsearch-proxy-rhel9@sha256:1d1d7d47b616995d18692f3a2d6232d7e1a0f41bc1503d87100f31939e080a78
VERIFICATIONS DONE
- The connectivity between the Elasticsearch Operator pod and the Elasticsearch service is correct and returns the results for Elasticsearch queries
- The connectivity between the Elasticsearch Operator pod and the Elasticsearch pods is correct and returns the results for Elasticsearch queries
- Not constraints/issues at the node level/network/storage level
- If set to Unmanaged the Elasticsearch Operator and the Cluster Logging Operator the Elasticsearch cluster recovers and gets GREEN and Healthy
Version-Release number of selected component (if applicable):
Elasticsearch Operator 5.8.13
How reproducible:
Not able to reproduce
Actual results:
Currently, the Elasticsearch Operator is an inconsistent status where not able to finish the upgrade and some Elasticsearch pods are using the new image and some the old.
When the Cluster Logging Operator and the Elasticsearch Operators are moved to `Unmanaged` and set the enable the allocation to `all` with the command below, the Elasticsearch cluster recovers and gets Healthy and in Green Status showing that the error is not in the Elasticsearch pod level
$ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \
-- es_util --query=_cluster/settings?pretty \
-X PUT -d '{"persistent": {"cluster.routing.allocation.enable":"all"}}'
If the Cluster Logging and the Elasticsearch Operator are moved to "Managed", the Operator will try again a rollout as it seems that it's in an inconsistent status for the Elasticsearch pod `elasticsearch-cdm-apmko5qx-1"`
Expected results:
Being able to move again the Cluster Logging Operator and the Elasticsearch Operator to `Managed` status finishing the rollout of the Elasticsearch pods for using the images provided for Elasticsearch 5.8.13 having the less disruption possible, this means, not having disruption of the service