We have discovered that in RHEL9 order of systemd dependencies for configure-ovs or nodeip-configuration or something similar have changed. It seems like now those run before systemd-user-sessions.service and as a consequence if the former fails when the machine starts, we cannot login to the RHCOS (at all)
The outline of what happens is more or less
- early in the process systemd-tmp* creates a lock file saying "only root user can login"
- late in the process systemd-user-sessions.service runs and is responsible for allowing anyone to login (namely, "core" user)
- if anything goes wrong and systemd-user-sessions.service doesn't start, then only "root" can get SSH access to the machine
Till now we never observed the issue even if configure-ovs wasn't healthy. However in 4.13 something has changed and as long as configure-ovs is not finished successfuly, we cannot do `ssh core@<node>`. Given that we don't allow root access, in those scenarios we are locked out from performing any investigation.
In the particular scenario I was debugging I had nodeip-configuration.service failing because it was unable to detect Node IP from the VIPs correctly. It was trying to select an empty IP as a Node IP, thus returning non-zero exit code. The network was up as I could ping and reach SSH port (machine had multiple NICs to make it effectively impossible to lose the network), but as I could never SSH as core and root user is locked, I was not able to collect any logs.
- clones
-
OCPBUGS-11124 configure-ovs blocks ssh access to the node when unhealthy
- Closed
- depends on
-
OCPBUGS-11124 configure-ovs blocks ssh access to the node when unhealthy
- Closed
- links to