Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-6345

Elasticsearch Operator upgrade is stuck `timed out waiting for node to rollout` upgrading to the Red Hat Elasticsearch Operator 5.8.13

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • Logging 5.8.13
    • Log Storage
    • False
    • None
    • False
    • NEW
    • NEW
    • Bug Fix
    • High
    • Critical

      Description of problem:

      Elasticsearch Operator upgrade is stuck `timed out waiting for node to rollout` upgrading to the Red Hat Elasticsearch Operator 5.8.13.

      The Elasticsearch CR is showing the message:

        nodes:
        - deploymentName: elasticsearch-cdm-apmko5qx-1
          upgradeStatus:
            scheduledUpgrade: "True"
            underUpgrade: "True"
            upgradePhase: preparationComplete
        - deploymentName: elasticsearch-cdm-apmko5qx-2
          upgradeStatus:
            scheduledUpgrade: "True"
            upgradePhase: controllerUpdated
        - deploymentName: elasticsearch-cdm-apmko5qx-3
          upgradeStatus:
            scheduledUpgrade: "True"
            upgradePhase: controllerUpdated
            
      

      As the Elasticsearch Operator is trying to rollout the deployments to use the new images provided with the Elasticsearch Operator 5.8.13, then, the deployments are PAUSED and the shardAllocation is `primaries`:

      // shard allocation
        shardAllocationEnabled: primaries
      
      // deployments in Paused
      $ oc get deployments -l component=elasticsearch  -o yaml |grep -i pause
          paused: true
            message: Deployment is paused
            reason: DeploymentPaused
          paused: true
            message: Deployment is paused
            reason: DeploymentPaused
      

      The Elasticsearch operator pod is always returning the error `timed out waiting for node to rollout`:

      $ oc logs elasticsearch-operator-694b9889ff-f79j5 -n openshift-operators-redhat
      ...
      2024-10-27T12:31:25.500535646Z {"_ts":"2024-10-27T12:31:25.50046439Z","_level":"0","_component":"elasticsearch-operator_controllers_Elasticsearch","_message":"unable to update node","_error":{"msg":"timed out waiting for node to rollout","node":"elasticsearch-cdm-apmko5qx-1"},"cluster":"elasticsearch","namespace":"openshift-logging"}
      2024-10-27T12:31:56.116689158Z {"_ts":"2024-10-27T12:31:56.116614221Z","_level":"0","_component":"elasticsearch-operator_controllers_Elasticsearch","_message":"unable to update node","_error":{"msg":"timed out waiting for node to rollout","node":"elasticsearch-cdm-apmko5qx-1"},"cluster":"elasticsearch","namespace":"openshift-logging"}
      

      But, the Elasticsearch pod is up and joined to the Elasticsearch cluster:

      $ oc get pods -l component=elasticsearch
      NAME                                            READY   STATUS    RESTARTS   AGE
      elasticsearch-cdm-apmko5qx-1-6f54cdcc65-rv5xj   2/2     Running   0          1d
      elasticsearch-cdm-apmko5qx-2-854c6bd8c-rp8gv    2/2     Running   0          29d
      elasticsearch-cdm-apmko5qx-3-59c99c9d66-rtlcw   2/2     Running   0          29d
      

      Currently, some Elasticsearch pods are using the old version of the Elasticsearch Operator and some the new version:

      $ oc get pods -l component=elasticsearch -o yaml |grep image:
            image: registry.redhat.io/openshift-logging/elasticsearch6-rhel9@sha256:89629964e44058bafc90393a235b08c7c974f05513aecbfe7229134da732f2b5
            image: registry.redhat.io/openshift-logging/elasticsearch-proxy-rhel9@sha256:a8f53dec42a46c5bf8ac7f2888848c01e70f405d27211e2a269730c888929faf
            image: registry.redhat.io/openshift-logging/elasticsearch6-rhel9@sha256:89629964e44058bafc90393a235b08c7c974f05513aecbfe7229134da732f2b5
            image: registry.redhat.io/openshift-logging/elasticsearch-proxy-rhel9@sha256:a8f53dec42a46c5bf8ac7f2888848c01e70f405d27211e2a269730c888929faf
            image: registry.redhat.io/openshift-logging/elasticsearch6-rhel9@sha256:bc97b8e13087050dfb1a75b02c5b780cbb8fb12a849a655169d072ba8bbf42b4
            image: registry.redhat.io/openshift-logging/elasticsearch-proxy-rhel9@sha256:1d1d7d47b616995d18692f3a2d6232d7e1a0f41bc1503d87100f31939e080a78
            image: registry.redhat.io/openshift-logging/elasticsearch6-rhel9@sha256:bc97b8e13087050dfb1a75b02c5b780cbb8fb12a849a655169d072ba8bbf42b4
            image: registry.redhat.io/openshift-logging/elasticsearch-proxy-rhel9@sha256:1d1d7d47b616995d18692f3a2d6232d7e1a0f41bc1503d87100f31939e080a78
            image: registry.redhat.io/openshift-logging/elasticsearch6-rhel9@sha256:bc97b8e13087050dfb1a75b02c5b780cbb8fb12a849a655169d072ba8bbf42b4
            image: registry.redhat.io/openshift-logging/elasticsearch-proxy-rhel9@sha256:1d1d7d47b616995d18692f3a2d6232d7e1a0f41bc1503d87100f31939e080a78
            image: registry.redhat.io/openshift-logging/elasticsearch6-rhel9@sha256:bc97b8e13087050dfb1a75b02c5b780cbb8fb12a849a655169d072ba8bbf42b4
            image: registry.redhat.io/openshift-logging/elasticsearch-proxy-rhel9@sha256:1d1d7d47b616995d18692f3a2d6232d7e1a0f41bc1503d87100f31939e080a78
      

      VERIFICATIONS DONE

      • The connectivity between the Elasticsearch Operator pod and the Elasticsearch service is correct and returns the results for Elasticsearch queries
      • The connectivity between the Elasticsearch Operator pod and the Elasticsearch pods is correct and returns the results for Elasticsearch queries
      • Not constraints/issues at the node level/network/storage level
      • If set to Unmanaged the Elasticsearch Operator and the Cluster Logging Operator the Elasticsearch cluster recovers and gets GREEN and Healthy

      Version-Release number of selected component (if applicable):

      Elasticsearch Operator 5.8.13

      How reproducible:

      Not able to reproduce

      Actual results:

      Currently, the Elasticsearch Operator is an inconsistent status where not able to finish the upgrade and some Elasticsearch pods are using the new image and some the old.

      When the Cluster Logging Operator and the Elasticsearch Operators are moved to `Unmanaged` and set the enable the allocation to `all` with the command below, the Elasticsearch cluster recovers and gets Healthy and in Green Status showing that the error is not in the Elasticsearch pod level

      $ oc exec -n openshift-logging -c elasticsearch $ES_POD_NAME \
        -- es_util --query=_cluster/settings?pretty \
        -X PUT -d '{"persistent": {"cluster.routing.allocation.enable":"all"}}'
      

      If the Cluster Logging and the Elasticsearch Operator are moved to "Managed", the Operator will try again a rollout as it seems that it's in an inconsistent status for the Elasticsearch pod `elasticsearch-cdm-apmko5qx-1"`

      Expected results:

      Being able to move again the Cluster Logging Operator and the Elasticsearch Operator to `Managed` status finishing the rollout of the Elasticsearch pods for using the images provided for Elasticsearch 5.8.13 having the less disruption possible, this means, not having disruption of the service

              Unassigned Unassigned
              rhn-support-ocasalsa Oscar Casal Sanchez
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: