Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-7821

RHEL CoreOS 9.2: Metal IPI Worker Nodes Rebooting During E2E Testing

XMLWordPrintable

    • Important
    • No
    • Approved
    • False
    • Hide

      None

      Show
      None

      After clearing some past problems we're now getting test results from metal-ipi on payload testing attempting to land the critical RHEL CoreOS 9.2 bump.

      Apologies on component here, I am not fully sure where to send this, there may be some overlap with storage, we might be headed towards a kernel bump, but I do not know how to get enough data out of the metal runs to prove it decisively and will need teams help to route and investigate.

      The problem is visible in this run but most easily seen in this artifact

      Looking at this chart you will see a few interesting things:

      • several test failures in long running storage tests (may be relevant?)
      • then a big batch of test failures all at the same time spanning many components
      • several concurrent batches of targetdown or daemonset stuck related alerts pending
      • several very short node not ready state blips for worker nodes

      I believe these node not ready may be a key piece of this.

      I believe I see worker-0 spontaneously rebooting around the time of the 10:18pm batch of pending alerts and failing tests. Taken from it's journal log :

      Feb 18 22:17:35.013156 worker-0 crio[1748]: time="2023-02-18 22:17:35.012955753Z" level=info msg="Stopped pod sandbox: 299d5073a2270301fc3a038e8de5e610908d3d7098295c285c9e15f0d315bb9c" id=0c317ddc-efe5-46c4-9765-e481c8bd5b32 name=/
      runtime.v1.RuntimeService/StopPodSandbox
      Feb 18 22:17:35.018421 worker-0 systemd[1]: var-lib-containers-storage-overlay-25c87b37df086dd308faa54eeb5afe11807e7f2e48749acef1aff6fddd5d7703-merged.mount: Deactivated successfully.
      Feb 18 22:17:35.018585 worker-0 systemd[1]: run-netns-d8c3584b\x2dc669\x2d435d\x2d97f3\x2d82548f50a00c.mount: Deactivated successfully.
      Feb 18 22:17:35.018661 worker-0 systemd[1]: run-ipcns-d8c3584b\x2dc669\x2d435d\x2d97f3\x2d82548f50a00c.mount: Deactivated successfully.
      Feb 18 22:17:35.018752 worker-0 systemd[1]: run-utsns-d8c3584b\x2dc669\x2d435d\x2d97f3\x2d82548f50a00c.mount: Deactivated successfully.
      Feb 18 22:17:35.018828 worker-0 systemd[1]: run-containers-storage-overlay\x2dcontainers-299d5073a2270301fc3a038e8de5e610908d3d7098295c285c9e15f0d315bb9c-userdata-shm.mount: Deactivated successfully.
      Feb 18 22:17:35.071975 worker-0 kubenswrapper[1888]: I0218 22:17:35.071898    1888 reconciler_common.go:169] "operationExecutor.UnmountVolume started for volume \"test-volume\" (UniqueName: \"kubernetes.io/local-volume/local-h2bmq\
      ") pod \"b332da6f-8c0d-482c-ab59-9377eeebd490\" (UID: \"b332da6f-8c0d-482c-ab59-9377eeebd490\") "
      Feb 18 22:17:35.071975 worker-0 kubenswrapper[1888]: I0218 22:17:35.071981    1888 reconciler_common.go:169] "operationExecutor.UnmountVolume started for volume \"kube-api-access-pdrzv\" (UniqueName: \"kubernetes.io/projected/b332d
      a6f-8c0d-482c-ab59-9377eeebd490-kube-api-access-pdrzv\") pod \"b332da6f-8c0d-482c-ab59-9377eeebd490\" (UID: \"b332da6f-8c0d-482c-ab59-9377eeebd490\") "
      Feb 18 22:17:35.072235 worker-0 kubenswrapper[1888]: I0218 22:17:35.072063    1888 reconciler_common.go:169] "operationExecutor.UnmountVolume started for volume \"liveness-probe-volume\" (UniqueName: \"kubernetes.io/empty-dir/b332d
      a6f-8c0d-482c-ab59-9377eeebd490-liveness-probe-volume\") pod \"b332da6f-8c0d-482c-ab59-9377eeebd490\" (UID: \"b332da6f-8c0d-482c-ab59-9377eeebd490\") "
      Feb 18 22:17:35.072839 worker-0 kubenswrapper[1888]: W0218 22:17:35.072753    1888 empty_dir.go:525] Warning: Failed to clear quota on /var/lib/kubelet/pods/b332da6f-8c0d-482c-ab59-9377eeebd490/volumes/kubernetes.io~empty-dir/liven
      ess-probe-volume: clearQuota called, but quotas disabled
      Feb 18 22:17:35.072839 worker-0 kubenswrapper[1888]: I0218 22:17:35.072793    1888 operation_generator.go:890] UnmountVolume.TearDown succeeded for volume "kubernetes.io/empty-dir/b332da6f-8c0d-482c-ab59-9377eeebd490-liveness-probe
      -volume" (OuterVolumeSpecName: "liveness-probe-volume") pod "b332da6f-8c0d-482c-ab59-9377eeebd490" (UID: "b332da6f-8c0d-482c-ab59-9377eeebd490"). InnerVolumeSpecName "liveness-probe-volume". PluginName "kubernetes.io/empty-dir", Vo
      lumeGidValue ""
      Feb 18 22:17:35.074901 worker-0 kubenswrapper[1888]: I0218 22:17:35.074862    1888 operation_generator.go:890] UnmountVolume.TearDown succeeded for volume "kubernetes.io/projected/b332da6f-8c0d-482c-ab59-9377eeebd490-kube-api-acces
      s-pdrzv" (OuterVolumeSpecName: "kube-api-access-pdrzv") pod "b332da6f-8c0d-482c-ab59-9377eeebd490" (UID: "b332da6f-8c0d-482c-ab59-9377eeebd490"). InnerVolumeSpecName "kube-api-access-pdrzv". PluginName "kubernetes.io/projected", Vo
      lumeGidValue ""
      Feb 18 22:17:35.078874 worker-0 kubenswrapper[1888]: I0218 22:17:35.078833    1888 operation_generator.go:890] UnmountVolume.TearDown succeeded for volume "kubernetes.io/local-volume/local-h2bmq" (OuterVolumeSpecName: "test-volume"
      ) pod "b332da6f-8c0d-482c-ab59-9377eeebd490" (UID: "b332da6f-8c0d-482c-ab59-9377eeebd490"). InnerVolumeSpecName "local-h2bmq". PluginName "kubernetes.io/local-volume", VolumeGidValue "777"
      Feb 18 22:18:10.084843 localhost kernel: Linux version 5.14.0-267.el9.x86_64 (mockbuild@x86-05.stream.rdu2.redhat.com) (gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4), GNU ld version 2.35.2-37.el9) #1 SMP PREEMPT_DYNAMIC Tue Feb 14 0
      0:19:32 UTC 2023
      Feb 18 22:18:10.084865 localhost kernel: The list of certified hardware and cloud instances for Red Hat Enterprise Linux 9 can be viewed at the Red Hat Ecosystem Catalog, https://catalog.redhat.com.
      Feb 18 22:18:10.084878 localhost kernel: Command line: BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-8d4877976c6caf6686076f80473fe19984b1286b83c1390d37b53ae64537d859/vmlinuz-5.14.0-267.el9.x86_64 ignition.platform.id=metal ostree=/ostree/boot
      .1/rhcos/8d4877976c6caf6686076f80473fe19984b1286b83c1390d37b53ae64537d859/0 ip=dhcp root=UUID=103a79cd-4741-4e96-891d-50647c6034f5 rw rootflags=prjquota boot=UUID=ee597f4c-9b43-4d0b-b3f3-a62dbe383528 systemd.unified_cgroup_hierarch
      y=0 systemd.legacy_systemd_cgroup_controller=1
      Feb 18 22:18:10.084886 localhost kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
      Feb 18 22:18:10.084893 localhost kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
      Feb 18 22:18:10.084900 localhost kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
      Feb 18 22:18:10.084909 localhost kernel: x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
      Feb 18 22:18:10.084916 localhost kernel: x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
      Feb 18 22:18:10.084923 localhost kernel: x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
      Feb 18 22:18:10.084931 localhost kernel: x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
      

      This looks to me like we're humming along running storage tests, and suddenly our machine is rebooting. AFAIK we do not have console logging anymore as this was disabled for metal-ipi recently, so we may not be able to see exact details on the crash.

      worker-2 shows a similar spontanous reboot in it's journal logs , again around the time of a batch of alerts and failed tests:

      Feb 18 22:36:53.720876 worker-2 kubenswrapper[2094]: E0218 22:36:53.720843    2094 kubelet_volumes.go:245] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"bbb3f8c5-441e-4e7c-aa61-23a0d3ae80a9\" found, but error occurred when trying to remove the volumes dir: not a directory" numErrs=1
      Feb 18 22:36:55.572897 worker-2 kubenswrapper[2094]: I0218 22:36:55.572833    2094 kubelet_getters.go:182] "Pod status updated" pod="openshift-kni-infra/keepalived-worker-2" status=Running
      Feb 18 22:36:55.572897 worker-2 kubenswrapper[2094]: I0218 22:36:55.572886    2094 kubelet_getters.go:182] "Pod status updated" pod="openshift-kni-infra/coredns-worker-2" status=Running
      Feb 18 22:36:55.721335 worker-2 kubenswrapper[2094]: E0218 22:36:55.721303    2094 kubelet_volumes.go:245] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"bbb3f8c5-441e-4e7c-aa61-23a0d3ae80a9\" found, but error occurred when trying to remove the volumes dir: not a directory" numErrs=1
      Feb 18 22:36:57.722385 worker-2 kubenswrapper[2094]: E0218 22:36:57.722353    2094 kubelet_volumes.go:245] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"bbb3f8c5-441e-4e7c-aa61-23a0d3ae80a9\" found, but error occurred when trying to remove the volumes dir: not a directory" numErrs=1
      Feb 18 22:36:59.721248 worker-2 kubenswrapper[2094]: E0218 22:36:59.721210    2094 kubelet_volumes.go:245] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"bbb3f8c5-441e-4e7c-aa61-23a0d3ae80a9\" found, but error occurred when trying to remove the volumes dir: not a directory" numErrs=1
      Feb 18 22:37:00.533678 worker-2 kubenswrapper[2094]: I0218 22:37:00.533602    2094 logs.go:323] "Finished parsing log file" path="/var/log/pods/e2e-volume-7441_nfs-server_a610504b-0d32-4379-9da6-5a549275a81f/nfs-server/1.log"
      Feb 18 22:37:01.726164 worker-2 kubenswrapper[2094]: E0218 22:37:01.722391    2094 kubelet_volumes.go:245] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"bbb3f8c5-441e-4e7c-aa61-23a0d3ae80a9\" found, but error occurred when trying to remove the volumes dir: not a directory" numErrs=1
      Feb 18 22:37:03.720865 worker-2 kubenswrapper[2094]: E0218 22:37:03.720819    2094 kubelet_volumes.go:245] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"bbb3f8c5-441e-4e7c-aa61-23a0d3ae80a9\" found, but error occurred when trying to remove the volumes dir: not a directory" numErrs=1
      Feb 18 22:37:25.108849 localhost kernel: Linux version 5.14.0-267.el9.x86_64 (mockbuild@x86-05.stream.rdu2.redhat.com) (gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4), GNU ld version 2.35.2-37.el9) #1 SMP PREEMPT_DYNAMIC Tue Feb 14 00:19:32 UTC 2023
      Feb 18 22:37:25.108869 localhost kernel: The list of certified hardware and cloud instances for Red Hat Enterprise Linux 9 can be viewed at the Red Hat Ecosystem Catalog, https://catalog.redhat.com.
      Feb 18 22:37:25.108882 localhost kernel: Command line: BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-8d4877976c6caf6686076f80473fe19984b1286b83c1390d37b53ae64537d859/vmlinuz-5.14.0-267.el9.x86_64 ignition.platform.id=metal ostree=/ostree/boot.1/rhcos/8d4877976c6caf6686076f80473fe19984b1286b83c1390d37b53ae64537d859/0 ip=dhcp root=UUID=1c376c2a-7da2-4206-90d7-3d745ce86ba9 rw rootflags=prjquota boot=UUID=a0d31a25-ff92-4955-af7c-d6c9fec7641a systemd.unified_cgroup_hierarchy=0 systemd.legacy_systemd_cgroup_controller=1
      Feb 18 22:37:25.108889 localhost kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
      Feb 18 22:37:25.108895 localhost kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
      Feb 18 22:37:25.108901 localhost kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
      Feb 18 22:37:25.108908 localhost kernel: x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
      Feb 18 22:37:25.108914 localhost kernel: x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
      Feb 18 22:37:25.108919 localhost kernel: x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
      

      AFAIK we should not be rebooting nodes intentionally during non-upgrade e2e testing. These reboots feel like a crash and the likely reason we're getting test failures.

      Storage does feel involved in both cases.

      Metal folks, is there any way we could get console logs to see the details on this crash?

      I have another test run underway to verify if this is repeatable.

              dhiggins@redhat.com Derek Higgins
              rhn-engineering-dgoodwin Devan Goodwin
              Pedro Jose Amoedo Martinez Pedro Jose Amoedo Martinez
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: