Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-39190

Rollback etcd-to-ephemeral procedure fails

XMLWordPrintable

    • -
    • Important
    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      After applying the etcd-to-ephemeral procedure successfully with the following considerations:

      • Deploy the masters without the ephemeral attribute in the master flavors.
      • Apply the machineConfig 98-var-lib-etcd
      • Change the CPMS to use a flavor that includes the ephemeral attribute set to 10G.

      The rollback fails due to etcd pods in crashloopback. 

      With above steps, the dir /sysroot/ostree/deploy/rhcos/var/lib/etcd/ is empty but the content is in /var/lib/etcd on vdb partition. So apparently the rollback cannot be performed:

      1. Rollback:

      [stack@undercloud-0 ~]$ oc delete -f 98-var-lib-etcd.yaml 
      machineconfig.machineconfiguration.openshift.io "98-var-lib-etcd" deleted

      2. The removal of the machineConfig started with master-1, and the volume is not mounted anymore:

      [stack@undercloud-0 ~]$ oc debug node/ostest-jnkbp-master-7cbwv-1 -- chroot /host lsblk
      Starting pod/ostest-jnkbp-master-7cbwv-1-debug-p8jw4 ...
      To use host binaries, run `chroot /host`
      NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
      vda    252:0    0   40G  0 disk 
      |-vda1 252:1    0    1M  0 part 
      |-vda2 252:2    0  127M  0 part 
      |-vda3 252:3    0  384M  0 part /boot
      `-vda4 252:4    0 39.5G  0 part /var
                                      /sysroot/ostree/deploy/rhcos/var
                                      /sysroot
                                      /usr
                                      /etc
                                      /
      vdb    252:16   0   10G  0 disk 
       

      But the etcd pod is not happy:

      [stack@undercloud-0 ~]$ oc get pods -n openshift-etcd -l app=etcd
      NAME                               READY   STATUS             RESTARTS         AGE
      etcd-ostest-jnkbp-master-7cbwv-1   3/4     CrashLoopBackOff   22 (2m32s ago)   24h
      etcd-ostest-jnkbp-master-gqbz7-2   4/4     Running            0                24h
      etcd-ostest-jnkbp-master-vp9mr-0   4/4     Running            0                24h 
      [stack@undercloud-0 ~]$ oc logs -n openshift-etcd etcd-ostest-jnkbp-master-7cbwv-1
      82789fe40c55eb75, started, ostest-jnkbp-master-gqbz7-2, https://10.196.0.107:2380, https://10.196.0.107:2379, false
      8e8f484b15ae158f, started, ostest-jnkbp-master-vp9mr-0, https://10.196.0.182:2380, https://10.196.0.182:2379, false
      b4d7fd333dda7cb3, started, ostest-jnkbp-master-7cbwv-1, https://10.196.1.177:2380, https://10.196.1.177:2379, false
      #### attempt 0
            member={name="ostest-jnkbp-master-gqbz7-2", peerURLs=[https://10.196.0.107:2380}, clientURLs=[https://10.196.0.107:2379]
            member={name="ostest-jnkbp-master-vp9mr-0", peerURLs=[https://10.196.0.182:2380}, clientURLs=[https://10.196.0.182:2379]
            member={name="ostest-jnkbp-master-7cbwv-1", peerURLs=[https://10.196.1.177:2380}, clientURLs=[https://10.196.1.177:2379]
            target={name="ostest-jnkbp-master-7cbwv-1", peerURLs=[https://10.196.1.177:2380}, clientURLs=[https://10.196.1.177:2379]

      where:

       [stack@undercloud-0 ~]$ oc get nodes -o wide
      NAME                          STATUS                     ROLES                  AGE     VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                KERNEL-VERSION                 CONTAINER-RUNTIME
      ostest-jnkbp-master-7cbwv-1   Ready                      control-plane,master   25h     v1.30.3   10.196.1.177   <none>        Red Hat Enterprise Linux CoreOS 417.94.202408170011-0   5.14.0-427.33.1.el9_4.x86_64   cri-o://1.30.4-5.rhaos4.17.git95e494c.el9
      ostest-jnkbp-master-gqbz7-2   Ready,SchedulingDisabled   control-plane,master   25h     v1.30.3   10.196.0.107   <none>        Red Hat Enterprise Linux CoreOS 417.94.202408170011-0   5.14.0-427.33.1.el9_4.x86_64   cri-o://1.30.4-5.rhaos4.17.git95e494c.el9
      ostest-jnkbp-master-vp9mr-0   Ready                      control-plane,master   26h     v1.30.3   10.196.0.182   <none>        Red Hat Enterprise Linux CoreOS 417.94.202408170011-0   5.14.0-427.33.1.el9_4.x86_64   cri-o://1.30.4-5.rhaos4.17.git95e494c.el9
      ostest-jnkbp-worker-0-4hl2w   Ready                      worker                 2d18h   v1.30.3   10.196.2.70    <none>        Red Hat Enterprise Linux CoreOS 417.94.202408170011-0   5.14.0-427.33.1.el9_4.x86_64   cri-o://1.30.4-5.rhaos4.17.git95e494c.el9
      ostest-jnkbp-worker-0-w7759   Ready                      worker                 2d18h   v1.30.3   10.196.3.100   <none>        Red Hat Enterprise Linux CoreOS 417.94.202408170011-0   5.14.0-427.33.1.el9_4.x86_64   cri-o://1.30.4-5.rhaos4.17.git95e494c.el9
       

      Version-Release number of selected component (if applicable):

      4.17.0-rc.0
      RHOS-17.1-RHEL-9-20240701.n.1

      How reproducible: Always

      Actual results: missing etcd member, cluster shows warnings.

      Expected results: The procedure can be successfully rolled back.

      Additional info: must-gather on private comment.

              rhn-gps-mbooth Matthew Booth
              rlobillo Ramón Lobillo
              Itshak Brown Itshak Brown
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: