-
Bug
-
Resolution: Done-Errata
-
Minor
-
None
-
4.13.0
-
+
-
No
-
MCO Sprint 233, MCO Sprint 234, MCO Sprint 235, MCO Sprint 236, MCO Sprint 237, MCO Sprint 238
-
6
-
False
-
Description of problem:
The MCD interprets the deletion of the only MachineConfig that contains a PasswdUser as the cluster admin requesting that the PasswdUser be deleted, which is incorrect. If an OpenShift cluster is installed without an SSH key in the install-config.yaml file and an SSH key is later added via a MachineConfig, deletion of that MachineConfig will cause the MachineConfigPool / node to degrade. Alternatively, if the SSH keys that an OpenShift cluster was installed with are deleted by removing the auto-generated MachineConfigs (e.g., 99-master-ssh-key / 99-worker-ssh-key), this behavior would be reproduced.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Install an OpenShift cluster without adding an SSH key to the install-config.yaml file. 2. Create a new MachineConfig containing only the necessary configuration to add an SSH key. It could look like: --- apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: worker-ssh-key spec: config: ignition: config: {} security: tls: {} timeouts: {} version: 3.2.0 passwd: users: - name: core sshAuthorizedKeys: - "ssh-rsa ..." 3. Apply the MachineConfig and wait for it to roll out to the MachineConfigPool. 4. Delete the MachineConfig.
An alternate manifestation of this is as follows: 1. Install a new OpenShift cluster which includes an SSH key in the install-config.yaml file. 2. Delete the 99-master-ssh-key / 99-worker-ssh-key MachineConfigs.
Actual results:
The affected MachineConfigPool will degrade thusly: Node ip-10-0-134-200.ec2.internal is reporting: "can''t reconcile config rendered-infra-4092d44329512507f9e54bec303543bd with rendered-infra-d4a0c9ad0d556a53fca42ba8be46d40e: ignition passwd user section contains unsupported changes: user core may not be deleted"'
Expected results:
The MachineConfig containing the SSH key should have been deleted without degradation.
Additional info:
This is a manifestation of https://bugzilla.redhat.com/show_bug.cgi?id=1932638. It is worth mentioning that a workaround exists where one edits the MachineConfig to remove the SSH keys while keeping the PasswdUser object around. This will allow the MachineConfig to pass validation while also removing the SSH keys from the authorized_keys file. However, I believe this isn't the most user-friendly workaround.
Two things are setting up this particular case:
- If the cluster is brought up without an SSH key configured, no MachineConfig is created that creates the PasswdUser object as shown above. This is supported by the fact that the rendered MachineConfig does not contain a PasswdUser object.
- If the SSH key is later added, the PasswdUser is created by the cluster admin to contain the SSH keys. This object then gets added to the rendered MachineConfig.
The reason why the MCD is degrading is because it thinks that because the PasswdUser is no longer present, we're trying to delete it. However, this is incorrect behavior with respect to the Ignition specification: To delete a user, one must use the shouldExist: false field. Additionally, it is worth mentioning that the MCO does not support the addition or deletion of users.
Fixing this seems straightforward by adjusting the MCD's reconcilable code to ignore an empty PasswdUser field thusly:
diff --git a/pkg/daemon/update.go b/pkg/daemon/update.go index f41efe4a0..da0b8e51b 100644 --- a/pkg/daemon/update.go +++ b/pkg/daemon/update.go @@ -824,9 +824,7 @@ func reconcilable(oldConfig, newConfig *mcfgv1.MachineConfig) (*machineConfigDif return nil, fmt.Errorf("ignition Passwd Groups section contains changes") } if !reflect.DeepEqual(oldIgn.Passwd.Users, newIgn.Passwd.Users) { - if len(oldIgn.Passwd.Users) > 0 && len(newIgn.Passwd.Users) == 0 { - return nil, fmt.Errorf("ignition passwd user section contains unsupported changes: user core may not be deleted") - } + newEmpty := len(newIgn.Passwd.Users) == 0 // there is an update to Users, we must verify that it is ONLY making an acceptable // change to the SSHAuthorizedKeys for the user "core" for _, user := range newIgn.Passwd.Users { @@ -835,9 +833,11 @@ func reconcilable(oldConfig, newConfig *mcfgv1.MachineConfig) (*machineConfigDif } } - glog.Infof("user data to be verified before ssh update: %v", newIgn.Passwd.Users[len(newIgn.Passwd.Users)-1]) - if err := verifyUserFields(newIgn.Passwd.Users[len(newIgn.Passwd.Users)-1]); err != nil { - return nil, err + if !newEmpty { + glog.Infof("user data to be verified before ssh update: %v", newIgn.Passwd.Users[len(newIgn.Passwd.Users)-1]) + if err := verifyUserFields(newIgn.Passwd.Users[len(newIgn.Passwd.Users)-1]); err != nil { + return nil, err + } } } }
In this scenario, if the PasswdUser object is deleted, any SSH keys that were associated with it should also be deleted. If that singular PasswdUser object contains all of the clusters' SSH keys, that should leave the /home/core/.ssh/authorized_keys }}(or) {{/home/core/.ssh/authorized_keys.d/ignition file present on the node, but it should be empty. If a PasswordHash was set, this operation should clear the password hash from the /etc/shadow file (though any unrelated content in that file should be untouched).
Additional conversation / context may be found here: https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1677847272957729.