Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
None
-
False
-
-
False
-
Low
Description
As recently discovered by a customer (see: case # 03661266), if the security policy on the cluster is set such that the "baseline" pod security admission level is set to enforcing on the redhat-ods-applications namespace, the ODH Notebook Controller will be unable to start.
This is because the Pod Security Admission mutating webhook will change the deployment to add ".spec.template.spec.securityContext.runAsNonRoot" to "true," and CRI-O refuses to mount the image to verify a named user is not UID 0. The image for this pod has the user set to "rhods" instead of "2000" like other images have. You can run the following commands to verify this:
$ # Select the 1.33.0 release of RHODS and output the image reference for the ODH Notebook Controller
$ podman run --rm --entrypoint sh registry.redhat.io/redhat/redhat-operator-index:v4.13 -c 'cat /configs/rhods-operator/catalog.json' | jq -r 'select(.name == "rhods-operator.1.33.0") | .relatedImages[] | select(.name == "odh_notebook_controller_image") | .image'
registry.redhat.io/rhods/odh-notebook-controller-rhel8@sha256:c1f27c9275e718375fc53573f1d99544b8e05b95fc1ab0643c53425feeeac76d
$ # Look at the user specified in the container image manifest
$ skopeo inspect --config docker://registry.redhat.io/rhods/odh-notebook-controller-rhel8@sha256:c1f27c9275e718375fc53573f1d99544b8e05b95fc1ab0643c53425feeeac76d | jq '.config.User'
"rhods"
$ # Compare that user to another image in the RHODS operator, such as the ModelMesh controller image (picked at random because it's the next image in the list)
$ podman run --rm --entrypoint sh registry.redhat.io/redhat/redhat-operator-index:v4.13 -c 'cat /configs/rhods-operator/catalog.json' | jq -r 'select(.name == "rhods-operator.1.33.0") | .relatedImages[] | select(.name == "odh_modelmesh_controller_image") | .image'
registry.redhat.io/rhods/odh-modelmesh-serving-controller-rhel8@sha256:dc25fb462552040d43159642c1220808d76cc60f604db5ae30fc6727128ab44b
$ skopeo inspect --config docker://registry.redhat.io/rhods/odh-modelmesh-serving-controller-rhel8@sha256:dc25fb462552040d43159642c1220808d76cc60f604db5ae30fc6727128ab44b | jq '.config.User'
"2000"
Although this image does, in fact, use a non-root user, it fails when the namespace is configured in this way. This isn't a normal or default configuration, and an attempt to reproduce the environment of the customer using a relatively bone-stock cluster (including through an OpenShift upgrade as the customer indicated) failed to exactly reproduce the issue, but it should be a trivial thing to fix and prevent from happening in the future if a customer's environment is set such that the SCC/PSA sync controller labels this namespace in this way in the future.
As a workaround for anyone experiencing this issue, we were able to follow the documentation to disable SCC/PSA sync and manually set the namespace to enforce the "privileged" PSA, then remove the mutation from the deployment and get a successful rollout of the ODH Notebook Controller pod. See here for reference:
https://kubernetes.io/docs/concepts/security/pod-security-admission/