Loading...

XML

Word

Printable

Type: Bug
Resolution: Duplicate
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.13
Component/s: Machine Config Operator / platform-baremetal
Labels:
None

Severity:
Critical
Regression:
No
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.14.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of the issue

What has been discovered by the OCP Bare Metal Networking Team is that under specific circumstances it is not possible to SSH as `core` user to the RHCOS node due to the `/run/nologin` not being removed. This happens e.g. when one of the services like `nodeip-configuration.service` or `ovs-configure.service` hangs; in such a scenario the network is up&running, but the systemd dependency chain is not satisfied.

Namely `systemd-user-sessions.service` is not reached thus /run/nologin is not removed.

This started to be 100% reproducible in OCP 4.13 and from our internal investigation (~~OCPBUGS-11124~~) points at `systemd-pcrphase.service` via the `remote-fs.target` (or internal issue with systemd but probably not). Our expertise ends however at this point so I am providing what we discovered and how to reproduce the issue easily.

Simplified reproducer

0) Start with vanilla RHCOS coming from OCP 4.13

1) Create following systemd unit

[Unit]
Description=Removes nologin and pauses the startup for debugging of OCPBUGS-11124
Before=systemd-pcrphase.service

[Service]
Type=oneshot
ExecStart=/bin/bash -c " \
  rm /run/nologin; \
  echo Now sleeping for 1 hour; \
  sleep 3600"
[Install]
WantedBy=multi-user.target

2) Enable the created unit

systemctl enable ocpbugs-11124-debugger.service

3) Reboot

4) SSH to the node. Note it may take up to 2 minutes. Be aware of your local client timeout

5) Confirm that the stuff is broken by e.g. looking at pam_systemd in the sshd log

$ systemctl status sshd.service
[...]
Apr 05 10:01:20 worker-0 sshd[1801]: Accepted publickey for core from 192.168.111.1 port 47642 ssh2: RSA SHA256:rXsegwlyTMAN4UfInanm336lxrh+23J4iPyjiuXt4/g
Apr 05 10:03:20 worker-0 sshd[1801]: pam_systemd(sshd:session): Failed to create session: Connection timed out
Apr 05 10:03:20 worker-0 sshd[1801]: pam_unix(sshd:session): session opened for user core(uid=1000) by (uid=0)

More real-life reproducer

Please note this reproducer above is simplifying a lot here because it will explicitly block `systemd-pcrphase.service`. But in a real OCP running in the field what we want to achieve is to plug sleep into `nodeip-configuration.service` so that it behaves like the unit runs for some long time instead of exiting immediately.

In real life you want to have /etc/systemd/system/nodeip-configuration.service looking a bit like this

[...]
ExecStart=/bin/bash -c " \
  rm /run/nologin; \
  sleep 3600; \
  until \
  /usr/bin/podman run --rm \
[...]

and then do everything as usual. With this modification (instead of creating a new unit) we are not changing any dependency nor ordering chain of systemd. So the investigation is really like it would be on the field.

Systemd analysis

Discussing with systemd folks, we discovered the following chain of dependencies - nodeip-configuration -> ovs-configure -> network-online.target -> remote-fs.target -> systemd-pcrphase.service > systemd-user-sessions.service

Still don't fully understand what changed between RHEL8 and RHEL9 and why...

Severity assessment

OCP nodes have only `core` user available. Root is disabled by design. With the outlined issue here the consequence is that if something goes wrong with the network configuration (due to user error or bug in one of the OCP Networking components), we cannot anymore SSH to the faulty node. The only available path to recover the access is to use single-user mode via the physical console. This is quite a limitation and is often not possible.

Ongoing discussions

#systemd-rhel – https://redhat-internal.slack.com/archives/C04NX2E8CDD/p1680690257320539
#forum-rhel-coreos – https://redhat-internal.slack.com/archives/C999USB0D/p1680181657578689

duplicates

OCPBUGS-11124 configure-ovs blocks ssh access to the node when unhealthy

Closed

relates to

OCPBUGS-11124 configure-ovs blocks ssh access to the node when unhealthy

Closed

Assignee:: Mat Kowalski

Reporter:: Mat Kowalski

QA Contact:: Michael Nguyen

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2023/04/05 10:18 AM

Updated:: 2023/05/30 8:14 AM

Resolved:: 2023/05/30 8:14 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates