[OCPBUGS-35538] [AWS] Failed to deploy compute node in Local Zone us-east-1-iah-2a

Type: Bug
Resolution: Cannot Reproduce
Priority: Major
Fix Version/s: None
Affects Version/s: 4.16
Component/s: Installer / openshift-installer
Labels:
- aws
- splatteam
- triaged

Severity:
Important
Regression:
No
Story Points:
3
Sprint:
OpenShift SPLAT - Sprint 256, OpenShift SPLAT - Sprint 257, OpenShift SPLAT - Sprint 258, OpenShift SPLAT - Sprint 259, OpenShift SPLAT - Sprint 260, OpenShift SPLAT - Sprint 261, OpenShift SPLAT - Sprint 262, OpenShift SPLAT - Sprint 263, OpenShift SPLAT - Sprint 265
sprint_count:
9
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

us-east-1-iah-2a is a new Local Zone released on Feb. 2024. [1]

EC2 machine was created successfully and running well in us-east-1-iah-2a, but it stuck in Provisioned status:

openshift-machine-api   yunjiang-lz1iah2-f9txv-edge-us-east-1-iah-2a-z6lfg   Provisioned   m6i.xlarge   us-east-1   us-east-1-iah-2a   176m

Checking machine services, kubelet-dependencies.target service not started:

○ kubelet-dependencies.target - Dependencies necessary to run kubelet
 	Loaded: loaded (/etc/systemd/system/kubelet-dependencies.target; static)
 	Active: inactive (dead)
   	Docs: https://github.com/openshift/machine-config-operator/

[1] https://aws.amazon.com/about-aws/whats-new/2024/02/aws-local-zone-houston/

Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-06-13-084629

How reproducible:

Always

Steps to Reproduce:

1. Create cluster with the following config:

 compute:
 - architecture: amd64
   hyperthreading: Enabled
   name: worker
   platform: {}
   replicas: 3
 - architecture: amd64
   hyperthreading: Enabled
   name: edge
   platform:
 	aws:
   	zones:
   	- us-east-1-iah-2a
   replicas: 1
 metadata:
   name: yunjiang-lz1iah2
 platform:
   aws:
 	region: us-east-1

Actual results:

The machine created in us-east-1-iah-2a stuck in Provisioned status.

Expected results:

No issues while deploying node in us-east-1-iah-2a

Additional info:

No issues in 4.15 TF install

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

image-2024-09-10-18-22-38-643.png
14 kB
2024/09/10 4:22 PM
image-2024-09-10-18-26-11-059.png
67 kB
2024/09/10 4:26 PM
image-2024-09-26-12-29-59-344.png
63 kB
2024/09/26 3:29 PM
journalctl.log.txt
2.66 MB
2024/09/25 6:04 PM
rdsosreport.txt
84 kB
2024/09/10 4:33 PM
rdsosreport-1.txt
88 kB
2024/09/25 6:03 PM
rdsosreport-2.txt
88 kB
2024/09/25 6:03 PM
Screenshot from 2024-09-26 12-22-39.png
66 kB
2024/09/26 3:29 PM
Screenshot from 2024-09-26 12-22-39-1.png
66 kB
2024/09/26 3:30 PM
Screenshot from 2024-09-26 12-22-53.png
63 kB
2024/09/26 3:29 PM

Julio Faerman added a comment - 2025/02/19 10:44 PM

It seems that this bug is no longer reproducible, just tested a few clusters.

As it doesn't seem to be active, I believe it's safe to close, let me know otherwise.

Julio Faerman added a comment - 2025/02/19 10:44 PM It seems that this bug is no longer reproducible, just tested a few clusters. As it doesn't seem to be active, I believe it's safe to close, let me know otherwise.

Julio Faerman added a comment - 2024/12/18 3:57 PM - edited

Also, I verified that using another instance type works, as in other local zones.
Will propose a PR changing the default instance type for this local zone as a workaround while we troubleshoot the underlying issue.

Julio Faerman added a comment - 2024/12/18 3:57 PM - edited Also, I verified that using another instance type works, as in other local zones. Will propose a PR changing the default instance type for this local zone as a workaround while we troubleshoot the underlying issue.

Julio Faerman added a comment - 2024/12/17 12:26 PM

This issue is still active, I'll proceed with scripting an UPI install and try to collect the debug information from ignition.

Julio Faerman added a comment - 2024/12/17 12:26 PM This issue is still active, I'll proceed with scripting an UPI install and try to collect the debug information from ignition.

Julio Faerman added a comment - 2024/10/17 11:55 AM

I'll continue working towards an reproducer as suggested in the other comments.

However, we have enough evidence to indicate something might be wrong in the EBS/AWS side on that region, I'll check opening a ticket with them as well.

Julio Faerman added a comment - 2024/10/17 11:55 AM I'll continue working towards an reproducer as suggested in the other comments. However, we have enough evidence to indicate something might be wrong in the EBS/AWS side on that region, I'll check opening a ticket with them as well.

Jonathan Lebon added a comment - 2024/10/10 2:01 AM

This is a tough one to debug because it involves (virtualized) hardware, udev and early boot stuff. Might be helpful to have udev print more debug information. Thankfully since it happens in the real root, one thing you could do is e.g. have a day-1 MC (or just add it to the Ignition config yourself) which adds e.g. `/etc/udev/udev.conf` with "udev_log=debug" to increase the verbosity of udev.

Jonathan Lebon added a comment - 2024/10/10 2:01 AM This is a tough one to debug because it involves (virtualized) hardware, udev and early boot stuff. Might be helpful to have udev print more debug information. Thankfully since it happens in the real root, one thing you could do is e.g. have a day-1 MC (or just add it to the Ignition config yourself) which adds e.g. `/etc/udev/udev.conf` with "udev_log=debug" to increase the verbosity of udev.

Julio Faerman added a comment - 2024/10/02 2:43 PM - edited

Here are the updated results with newer releases:

Version | Zone || Provisioned | Running
4.16.13 | us-east-iah-2a || 3 | 7
4.16.13 | us-east-iah-1a || 0 | 10
4.17.00 | us-east-iah-2a || 1 | 9
4.17.00 | us-east-iah-1a || 0 | 10

From that, we can see that it fails less on the new release, and not at all in other local zones.

So it's probably an infrastructure issue, but one that we could probably work around.

However, I'm not sure how to go about finding the root cause. Clusters are created with default configurations mostly, and easy to reproduce inside OCP, but not sure how could we reproduce outside OCP if that would be absolutely necessary.

Any suggestions for next steps?

Julio Faerman added a comment - 2024/10/02 2:43 PM - edited Here are the updated results with newer releases: Version | Zone || Provisioned | Running 4.16.13 | us-east-iah-2a || 3 | 7 4.16.13 | us-east-iah-1a || 0 | 10 4.17.00 | us-east-iah-2a || 1 | 9 4.17.00 | us-east-iah-1a || 0 | 10 From that, we can see that it fails less on the new release, and not at all in other local zones. So it's probably an infrastructure issue, but one that we could probably work around. However, I'm not sure how to go about finding the root cause. Clusters are created with default configurations mostly, and easy to reproduce inside OCP, but not sure how could we reproduce outside OCP if that would be absolutely necessary. Any suggestions for next steps?

Julio Faerman added a comment - 2024/10/01 4:25 PM

jlebon1@redhat.com Totally odd... I'll run another test with new releases and update the numbers.

What i can see is that if fails a lot in that zone with 4.16, a lot less with 4.17, not at all on other zones.

I also tried launching the instances outside OCP, with the cluster provisioned and up, same user-data and all, but that did not work well.

Julio Faerman added a comment - 2024/10/01 4:25 PM jlebon1@redhat.com Totally odd... I'll run another test with new releases and update the numbers. What i can see is that if fails a lot in that zone with 4.16, a lot less with 4.17, not at all on other zones. I also tried launching the instances outside OCP, with the cluster provisioned and up, same user-data and all, but that did not work well.

Jonathan Lebon added a comment - 2024/09/27 7:05 PM

Sep 25 13:03:43 ip-10-0-112-55 systemd[1]: dev-disk-by\x2duuid-29248b10\x2dc5ef\x2d4755\x2db887\x2d40a4064730ba.device: Job dev-disk-by\x2duuid-29248b10\x2dc5ef\x2d4755\x2db887\x2d40a4064730ba.device/start timed out.
Sep 25 13:03:43 ip-10-0-112-55 systemd[1]: Timed out waiting for device /dev/disk/by-uuid/29248b10-c5ef-4755-b887-40a4064730ba.

Hmm, very odd. I'm not sure what's going on here. I tried reproducing this using the same AMI and instance type --count 10 but couldn't. This might be some kind of udev race somewhere but it's hard to tell. The device shows up fine in the initramfs and we can e.g. mount it, but then seems to be absent in the real root.

And you're saying that this doesn't reproduce in other availability zones?

Jonathan Lebon added a comment - 2024/09/27 7:05 PM Sep 25 13:03:43 ip-10-0-112-55 systemd[1]: dev-disk-by\x2duuid-29248b10\x2dc5ef\x2d4755\x2db887\x2d40a4064730ba.device: Job dev-disk-by\x2duuid-29248b10\x2dc5ef\x2d4755\x2db887\x2d40a4064730ba.device/start timed out. Sep 25 13:03:43 ip-10-0-112-55 systemd[1]: Timed out waiting for device /dev/disk/by-uuid/29248b10-c5ef-4755-b887-40a4064730ba. Hmm, very odd. I'm not sure what's going on here. I tried reproducing this using the same AMI and instance type --count 10 but couldn't. This might be some kind of udev race somewhere but it's hard to tell. The device shows up fine in the initramfs and we can e.g. mount it, but then seems to be absent in the real root. And you're saying that this doesn't reproduce in other availability zones?

Julio Faerman added a comment - 2024/09/26 3:30 PM

I connected to the serial console on first boot and collected the attached screenshots of the issue.

I would need some help to determine why we have "[ TIME ] Timed out waiting for device 4-2c91-4a9b-ae48-0d6fd3d045cd." even though the EBS disk is provisioned and attached correctly.

I'll try opening a separate ticket, but I wasn't able to reproduce this issue outside OCP, will probably need some direct collaboration to progress on this issue.

Julio Faerman added a comment - 2024/09/26 3:30 PM I connected to the serial console on first boot and collected the attached screenshots of the issue. I would need some help to determine why we have "[ TIME ] Timed out waiting for device 4-2c91-4a9b-ae48-0d6fd3d045cd." even though the EBS disk is provisioned and attached correctly. I'll try opening a separate ticket, but I wasn't able to reproduce this issue outside OCP, will probably need some direct collaboration to progress on this issue.

Julio Faerman added a comment - 2024/09/25 6:04 PM - edited

Here's the journalctl output journalctl.log.txt

^{line 1648:}

^{Sep 25 13:03:15 ip-10-0-112-55 systemd-udevd[1531]: Using default interface naming scheme 'rhel-9.0'.}
^{Sep 25 13:03:43 ip-10-0-112-55 systemd[1]: dev-disk-by\x2duuid-29248b10\x2dc5ef\x2d4755\x2db887\x2d40a4064730ba.device: Job dev-disk-by\x2duuid-29248b10\x2dc5ef\x2d4755\x2db887\x2d40a4064730ba.device/start timed out.}
^{Sep 25 13:03:43 ip-10-0-112-55 systemd[1]: Timed out waiting for device /dev/disk/by-uuid/29248b10-c5ef-4755-b887-40a4064730ba.}

Julio Faerman added a comment - 2024/09/25 6:04 PM - edited Here's the journalctl output journalctl.log.txt line 1648: Sep 25 13:03:15 ip-10-0-112-55 systemd-udevd [1531] : Using default interface naming scheme 'rhel-9.0'. Sep 25 13:03:43 ip-10-0-112-55 systemd [1] : dev-disk-by\x2duuid-29248b10\x2dc5ef\x2d4755\x2db887\x2d40a4064730ba.device: Job dev-disk-by\x2duuid-29248b10\x2dc5ef\x2d4755\x2db887\x2d40a4064730ba.device/start timed out. Sep 25 13:03:43 ip-10-0-112-55 systemd [1] : Timed out waiting for device /dev/disk/by-uuid/29248b10-c5ef-4755-b887-40a4064730ba.

Assignee:: Julio Faerman

Reporter:: Yunfei Jiang

QA Contact:: Yunfei Jiang

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2024/06/17 5:27 AM

Updated:: 2025/02/19 10:45 PM

Resolved:: 2025/02/19 10:45 PM

Details

Description

Attachments

Attachments

Easy Agile Planning Poker

Activity

Collapse comment: Julio Faerman added a comment - 2025/02/19 10:44 PM

Expand comment: Julio Faerman added a comment - 2025/02/19 10:44 PM

Collapse comment: Julio Faerman added a comment - 2024/12/18 3:57 PM, Edited by Julio Faerman - 2024/12/18 4:00 PM

Expand comment: Julio Faerman added a comment - 2024/12/18 3:57 PM, Edited by Julio Faerman - 2024/12/18 4:00 PM

Collapse comment: Julio Faerman added a comment - 2024/12/17 12:26 PM

Expand comment: Julio Faerman added a comment - 2024/12/17 12:26 PM

Collapse comment: Julio Faerman added a comment - 2024/10/17 11:55 AM

Expand comment: Julio Faerman added a comment - 2024/10/17 11:55 AM

Collapse comment: Jonathan Lebon added a comment - 2024/10/10 2:01 AM

Expand comment: Jonathan Lebon added a comment - 2024/10/10 2:01 AM

Collapse comment: Julio Faerman added a comment - 2024/10/02 2:43 PM, Edited by Julio Faerman - 2024/10/02 2:49 PM

Expand comment: Julio Faerman added a comment - 2024/10/02 2:43 PM, Edited by Julio Faerman - 2024/10/02 2:49 PM

Collapse comment: Julio Faerman added a comment - 2024/10/01 4:25 PM

Expand comment: Julio Faerman added a comment - 2024/10/01 4:25 PM

Collapse comment: Jonathan Lebon added a comment - 2024/09/27 7:05 PM

Expand comment: Jonathan Lebon added a comment - 2024/09/27 7:05 PM

Collapse comment: Julio Faerman added a comment - 2024/09/26 3:30 PM

Expand comment: Julio Faerman added a comment - 2024/09/26 3:30 PM

Collapse comment: Julio Faerman added a comment - 2024/09/25 6:04 PM, Edited by Julio Faerman - 2024/09/25 6:08 PM

Expand comment: Julio Faerman added a comment - 2024/09/25 6:04 PM, Edited by Julio Faerman - 2024/09/25 6:08 PM

People

Dates