Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-35538

[AWS] Failed to deploy compute node in Local Zone us-east-1-iah-2a

    • Important
    • No
    • 3
    • OpenShift SPLAT - Sprint 256, OpenShift SPLAT - Sprint 257, OpenShift SPLAT - Sprint 258, OpenShift SPLAT - Sprint 259, OpenShift SPLAT - Sprint 260, OpenShift SPLAT - Sprint 261, OpenShift SPLAT - Sprint 262, OpenShift SPLAT - Sprint 263, OpenShift SPLAT - Sprint 265
    • 9
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      us-east-1-iah-2a is a new Local Zone released on Feb. 2024. [1]
      
      EC2 machine was created successfully and running well in us-east-1-iah-2a, but it stuck in Provisioned status:
      
      openshift-machine-api   yunjiang-lz1iah2-f9txv-edge-us-east-1-iah-2a-z6lfg   Provisioned   m6i.xlarge   us-east-1   us-east-1-iah-2a   176m
      
      Checking machine services, kubelet-dependencies.target service not started:
      
      โ—‹ kubelet-dependencies.target - Dependencies necessary to run kubelet
       	Loaded: loaded (/etc/systemd/system/kubelet-dependencies.target; static)
       	Active: inactive (dead)
         	Docs: https://github.com/openshift/machine-config-operator/
      
      [1] https://aws.amazon.com/about-aws/whats-new/2024/02/aws-local-zone-houston/
      
       

      Version-Release number of selected component (if applicable):

      4.16.0-0.nightly-2024-06-13-084629
       

      How reproducible:

      Always
       

      Steps to Reproduce:

      1. Create cluster with the following config:
      
       compute:
       - architecture: amd64
         hyperthreading: Enabled
         name: worker
         platform: {}
         replicas: 3
       - architecture: amd64
         hyperthreading: Enabled
         name: edge
         platform:
       	aws:
         	zones:
         	- us-east-1-iah-2a
         replicas: 1
       metadata:
         name: yunjiang-lz1iah2
       platform:
         aws:
       	region: us-east-1
      

      Actual results:

      The machine created in us-east-1-iah-2a stuck in Provisioned status.
       

      Expected results:

      No issues while deploying node in us-east-1-iah-2a
       

      Additional info:

      No issues in 4.15 TF install
       

        1. image-2024-09-10-18-22-38-643.png
          image-2024-09-10-18-22-38-643.png
          14 kB
        2. image-2024-09-10-18-26-11-059.png
          image-2024-09-10-18-26-11-059.png
          67 kB
        3. image-2024-09-26-12-29-59-344.png
          image-2024-09-26-12-29-59-344.png
          63 kB
        4. journalctl.log.txt
          2.66 MB
        5. rdsosreport.txt
          84 kB
        6. rdsosreport-1.txt
          88 kB
        7. rdsosreport-2.txt
          88 kB
        8. Screenshot from 2024-09-26 12-22-39.png
          Screenshot from 2024-09-26 12-22-39.png
          66 kB
        9. Screenshot from 2024-09-26 12-22-39-1.png
          Screenshot from 2024-09-26 12-22-39-1.png
          66 kB
        10. Screenshot from 2024-09-26 12-22-53.png
          Screenshot from 2024-09-26 12-22-53.png
          63 kB

            [OCPBUGS-35538] [AWS] Failed to deploy compute node in Local Zone us-east-1-iah-2a

            It seems that this bug is no longer reproducible, just tested a few clusters.

            As it doesn't seem to be active, I believe it's safe to close, let me know otherwise.

            Julio Faerman added a comment - It seems that this bug is no longer reproducible, just tested a few clusters. As it doesn't seem to be active, I believe it's safe to close, let me know otherwise.

            Julio Faerman added a comment - - edited

            Also, I verified that using another instance type works, as in other local zones.
            Will propose a PR changing the default instance type for this local zone as a workaround while we troubleshoot the underlying issue.

            Julio Faerman added a comment - - edited Also, I verified that using another instance type works, as in other local zones. Will propose a PR changing the default instance type for this local zone as a workaround while we troubleshoot the underlying issue.

            This issue is still active, I'll proceed with scripting an UPI install and try to collect the debug information from ignition.

            Julio Faerman added a comment - This issue is still active, I'll proceed with scripting an UPI install and try to collect the debug information from ignition.

            I'll continue working towards an reproducer as suggested in the other comments.

            However, we have enough evidence to indicate something might be wrong in the EBS/AWS side on that region, I'll check opening a ticket with them as well.

            Julio Faerman added a comment - I'll continue working towards an reproducer as suggested in the other comments. However, we have enough evidence to indicate something might be wrong in the EBS/AWS side on that region, I'll check opening a ticket with them as well.

            This is a tough one to debug because it involves (virtualized) hardware, udev and early boot stuff. Might be helpful to have udev print more debug information. Thankfully since it happens in the real root, one thing you could do is e.g. have a day-1 MC (or just add it to the Ignition config yourself) which adds e.g. `/etc/udev/udev.conf` with "udev_log=debug" to increase the verbosity of udev.

            Jonathan Lebon added a comment - This is a tough one to debug because it involves (virtualized) hardware, udev and early boot stuff. Might be helpful to have udev print more debug information. Thankfully since it happens in the real root, one thing you could do is e.g. have a day-1 MC (or just add it to the Ignition config yourself) which adds e.g. `/etc/udev/udev.conf` with "udev_log=debug" to increase the verbosity of udev.

            Julio Faerman added a comment - - edited

            Here are the updated results with newer releases:

            Version | Zone           || Provisioned | Running
            4.16.13 | us-east-iah-2a ||           3 |       7
            4.16.13 | us-east-iah-1a ||           0 |      10
            4.17.00 | us-east-iah-2a ||           1 |       9
            4.17.00 | us-east-iah-1a ||           0 |      10

            From that, we can see that it fails less on the new release, and not at all in other local zones.

            So it's probably an infrastructure issue, but one that we could probably work around.

            However, I'm not sure how to go about finding the root cause. Clusters are created with default configurations mostly, and easy to reproduce inside OCP, but not sure how could we reproduce outside OCP if that would be absolutely necessary.

            Any suggestions for next steps?

             

            Julio Faerman added a comment - - edited Here are the updated results with newer releases: Version | Zone           || Provisioned | Running 4.16.13 | us-east-iah-2a ||           3 |       7 4.16.13 | us-east-iah-1a ||           0 |      10 4.17.00 | us-east-iah-2a ||           1 |       9 4.17.00 | us-east-iah-1a ||           0 |      10 From that, we can see that it fails less on the new release, and not at all in other local zones. So it's probably an infrastructure issue, but one that we could probably work around. However, I'm not sure how to go about finding the root cause. Clusters are created with default configurations mostly, and easy to reproduce inside OCP, but not sure how could we reproduce outside OCP if that would be absolutely necessary. Any suggestions for next steps?  

            jlebon1@redhat.com Totally odd... I'll run another test with new releases and update the numbers.

            What i can see is that if fails a lot in that zone with 4.16, a lot less with 4.17, not at all on other zones.

            I also tried launching the instances outside OCP, with the cluster provisioned and up, same user-data and all, but that did not work well.

            Julio Faerman added a comment - jlebon1@redhat.com Totally odd... I'll run another test with new releases and update the numbers. What i can see is that if fails a lot in that zone with 4.16, a lot less with 4.17, not at all on other zones. I also tried launching the instances outside OCP, with the cluster provisioned and up, same user-data and all, but that did not work well.

             

            Sep 25 13:03:43 ip-10-0-112-55 systemd[1]: dev-disk-by\x2duuid-29248b10\x2dc5ef\x2d4755\x2db887\x2d40a4064730ba.device: Job dev-disk-by\x2duuid-29248b10\x2dc5ef\x2d4755\x2db887\x2d40a4064730ba.device/start timed out.
            Sep 25 13:03:43 ip-10-0-112-55 systemd[1]: Timed out waiting for device /dev/disk/by-uuid/29248b10-c5ef-4755-b887-40a4064730ba.

             

            Hmm, very odd. I'm not sure what's going on here. I tried reproducing this using the same AMI and instance type --count 10 but couldn't. This might be some kind of udev race somewhere but it's hard to tell. The device shows up fine in the initramfs and we can e.g. mount it, but then seems to be absent in the real root.

            And you're saying that this doesn't reproduce in other availability zones?

            Jonathan Lebon added a comment -   Sep 25 13:03:43 ip-10-0-112-55 systemd[1]: dev-disk-by\x2duuid-29248b10\x2dc5ef\x2d4755\x2db887\x2d40a4064730ba.device: Job dev-disk-by\x2duuid-29248b10\x2dc5ef\x2d4755\x2db887\x2d40a4064730ba.device/start timed out. Sep 25 13:03:43 ip-10-0-112-55 systemd[1]: Timed out waiting for device /dev/disk/by-uuid/29248b10-c5ef-4755-b887-40a4064730ba.   Hmm, very odd. I'm not sure what's going on here. I tried reproducing this using the same AMI and instance type --count 10 but couldn't. This might be some kind of udev race somewhere but it's hard to tell. The device shows up fine in the initramfs and we can e.g. mount it, but then seems to be absent in the real root. And you're saying that this doesn't reproduce in other availability zones?

            I connected to the serial console on first boot and collected the attached screenshots of the issue.

            I would need some help to determine why we have "[ TIME ] Timed out waiting for device 4-2c91-4a9b-ae48-0d6fd3d045cd." even though the EBS disk is provisioned and attached correctly.

            I'll try opening a separate ticket, but I wasn't able to reproduce this issue outside OCP, will probably need some direct collaboration to progress on this issue.

             

            Julio Faerman added a comment - I connected to the serial console on first boot and collected the attached screenshots of the issue. I would need some help to determine why we have "[ TIME ] Timed out waiting for device 4-2c91-4a9b-ae48-0d6fd3d045cd." even though the EBS disk is provisioned and attached correctly. I'll try opening a separate ticket, but I wasn't able to reproduce this issue outside OCP, will probably need some direct collaboration to progress on this issue.  

            Julio Faerman added a comment - - edited

            Here's the journalctl output journalctl.log.txt

            line 1648:

            Sep 25 13:03:15 ip-10-0-112-55 systemd-udevd[1531]: Using default interface naming scheme 'rhel-9.0'.
            Sep 25 13:03:43 ip-10-0-112-55 systemd[1]: dev-disk-by\x2duuid-29248b10\x2dc5ef\x2d4755\x2db887\x2d40a4064730ba.device: Job dev-disk-by\x2duuid-29248b10\x2dc5ef\x2d4755\x2db887\x2d40a4064730ba.device/start timed out.
            Sep 25 13:03:43 ip-10-0-112-55 systemd[1]: Timed out waiting for device /dev/disk/by-uuid/29248b10-c5ef-4755-b887-40a4064730ba.

            Julio Faerman added a comment - - edited Here's the journalctl output journalctl.log.txt line 1648: Sep 25 13:03:15 ip-10-0-112-55 systemd-udevd [1531] : Using default interface naming scheme 'rhel-9.0'. Sep 25 13:03:43 ip-10-0-112-55 systemd [1] : dev-disk-by\x2duuid-29248b10\x2dc5ef\x2d4755\x2db887\x2d40a4064730ba.device: Job dev-disk-by\x2duuid-29248b10\x2dc5ef\x2d4755\x2db887\x2d40a4064730ba.device/start timed out. Sep 25 13:03:43 ip-10-0-112-55 systemd [1] : Timed out waiting for device /dev/disk/by-uuid/29248b10-c5ef-4755-b887-40a4064730ba.

              faermanj Julio Faerman
              yunjiang-1 Yunfei Jiang
              Yunfei Jiang Yunfei Jiang
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: