Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-37833

HCP Machinepool Upgade - Stuck for more than 3 hours: etcd out of DB Space

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.15.z
    • HyperShift
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • Hypershift Sprint 258, Hypershift Sprint 259
    • 2
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      Perfoming a machinepool upgrade on 503 nodes ROSA-HCP from OCP 4.15.17 --> OCP 4.15.22, with maxUnavailable set to 50%
      while the cluster is pre-loaded with cluster-density-v2 workload
      (https://kube-burner.github.io/kube-burner-ocp/latest/#cluster-density-v2)

      Version-Release number of selected component (if applicable):

      Control-Plane: 4.15.22
      Machinepool: Upgrading from 4.15.17 --> 4.15.22

      Steps to Reproduce:

          1. kube-burner-ocp cluster-density-v2 --iterations=4509 --churn=false --gc=false
          2. rosa edit machinepool --max-surge=$0% --max-unavailable=50% --cluster=2cs9mdk9eeopmhqf5f69n48ojo8qofc0 <worker-0|worker-1|worker-2>
          3. rosa upgrade machinepool <worker-0|worker-1|worker-2> -y -c 2cs9mdk9eeopmhqf5f69n48ojo8qofc0 --version 4.15.22     

      Actual results:

      Upgrade not progressing even after 3hours

      Expected results:

      Upgrade to be successful    

      Additional info:

      =======================================================
      $ oc get co
      NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      console                                    4.15.22   True        False         False      14h     
      csi-snapshot-controller                    4.15.22   True        False         False      13h     
      dns                                        4.15.22   True        True          False      14h     DNS "default" reports Progressing=True: "Have 488 available node-resolver pods, want 493."
      image-registry                             4.15.22   True        True          False      14h     Progressing: The deployment has not completed...
      ingress                                    4.15.22   True        True          False      4h22m   ingresscontroller "default" is progressing: IngressControllerProgressing: One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 1 of 2 updated replica(s) are available......
      insights                                   4.15.22   True        False         False      14h     
      kube-apiserver                             4.15.22   True        False         False      14h     
      kube-controller-manager                    4.15.22   True        False         False      14h     
      kube-scheduler                             4.15.22   True        False         False      14h     
      kube-storage-version-migrator              4.15.22   True        False         False      6h24m   
      monitoring                                 4.15.22   Unknown     True          Unknown    4h2m    Rolling out the stack.
      network                                    4.15.22   True        True          True       14h     DaemonSet "/openshift-multus/multus" rollout is not making progress - pod multus-2hxt5 is in CrashLoopBackOff State...
      node-tuning                                4.15.22   True        True          False      3h52m   Waiting for 93/493 Profiles to be applied
      openshift-apiserver                        4.15.22   True        False         False      14h     
      openshift-controller-manager               4.15.22   True        False         False      14h     
      openshift-samples                          4.15.22   True        False         False      13h     
      operator-lifecycle-manager                 4.15.22   True        False         False      14h     
      operator-lifecycle-manager-catalog         4.15.22   True        False         False      14h     
      operator-lifecycle-manager-packageserver   4.15.22   True        False         False      14h     
      service-ca                                 4.15.22   True        False         False      14h     
      storage                                    4.15.22   True        True          False      13h     AWSEBSCSIDriverOperatorCRProgressing: AWSEBSDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods
      $
      =======================================================
      $ oc logs etcd-2 -c etcd
      {"level":"warn","ts":"2024-08-01T12:49:27.855883Z","caller":"etcdserver/util.go:123","msg":"failed to apply request","took":"47.807µs","request":"header:<ID:2186938574361351123 username:\"etcd-client\" auth_revision:1 > txn:<compare:<target:MOD key:\"/kubernetes.io/pods/cluster-density-v2-3069/client-2-84b8b6777-mt8c2\" mod_revision:9443487 > success:<request_put:<key:\"/kubernetes.io/pods/cluster-density-v2-3069/client-2-84b8b6777-mt8c2\" value_size:8947 >> failure:<request_range:<key:\"/kubernetes.io/pods/cluster-density-v2-3069/client-2-84b8b6777-mt8c2\" > >>","response":"size:20","error":"etcdserver: no space"}
      $
      =======================================================

              Unassigned Unassigned
              krvoora-ocm Harsha Voora (Inactive)
              None
              None
              Ge Liu Ge Liu
              None
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated: