Uploaded image for project: 'Red Hat OpenShift AI Engineering'
  1. Red Hat OpenShift AI Engineering
  2. RHOAIENG-72

[case 03661266]Under some circumstances, the ODH Notebook Controller pod is unable to start

    XMLWordPrintable

Details

    • False
    • Hide

      None

      Show
      None
    • False
    • Low

    Description

      As recently discovered by a customer (see: case # 03661266), if the security policy on the cluster is set such that the "baseline" pod security admission level is set to enforcing on the redhat-ods-applications namespace, the ODH Notebook Controller will be unable to start.

      This is because the Pod Security Admission mutating webhook will change the deployment to add ".spec.template.spec.securityContext.runAsNonRoot" to "true," and CRI-O refuses to mount the image to verify a named user is not UID 0. The image for this pod has the user set to "rhods" instead of "2000" like other images have. You can run the following commands to verify this:

      $ # Select the 1.33.0 release of RHODS and output the image reference for the ODH Notebook Controller
      $ podman run --rm --entrypoint sh registry.redhat.io/redhat/redhat-operator-index:v4.13 -c 'cat /configs/rhods-operator/catalog.json' | jq -r 'select(.name == "rhods-operator.1.33.0") | .relatedImages[] | select(.name == "odh_notebook_controller_image") | .image'
      registry.redhat.io/rhods/odh-notebook-controller-rhel8@sha256:c1f27c9275e718375fc53573f1d99544b8e05b95fc1ab0643c53425feeeac76d
      $ # Look at the user specified in the container image manifest
      $ skopeo inspect --config docker://registry.redhat.io/rhods/odh-notebook-controller-rhel8@sha256:c1f27c9275e718375fc53573f1d99544b8e05b95fc1ab0643c53425feeeac76d | jq '.config.User'
      "rhods"
      $ # Compare that user to another image in the RHODS operator, such as the ModelMesh controller image (picked at random because it's the next image in the list)
      $ podman run --rm --entrypoint sh registry.redhat.io/redhat/redhat-operator-index:v4.13 -c 'cat /configs/rhods-operator/catalog.json' | jq -r 'select(.name == "rhods-operator.1.33.0") | .relatedImages[] | select(.name == "odh_modelmesh_controller_image") | .image'
      registry.redhat.io/rhods/odh-modelmesh-serving-controller-rhel8@sha256:dc25fb462552040d43159642c1220808d76cc60f604db5ae30fc6727128ab44b
      $ skopeo inspect --config docker://registry.redhat.io/rhods/odh-modelmesh-serving-controller-rhel8@sha256:dc25fb462552040d43159642c1220808d76cc60f604db5ae30fc6727128ab44b | jq '.config.User'
      "2000"

      Although this image does, in fact, use a non-root user, it fails when the namespace is configured in this way. This isn't a normal or default configuration, and an attempt to reproduce the environment of the customer using a relatively bone-stock cluster (including through an OpenShift upgrade as the customer indicated) failed to exactly reproduce the issue, but it should be a trivial thing to fix and prevent from happening in the future if a customer's environment is set such that the SCC/PSA sync controller labels this namespace in this way in the future.

      As a workaround for anyone experiencing this issue, we were able to follow the documentation to disable SCC/PSA sync and manually set the namespace to enforce the "privileged" PSA, then remove the mutation from the deployment and get a successful rollout of the ODH Notebook Controller pod. See here for reference:

      https://docs.openshift.com/container-platform/4.13/authentication/understanding-and-managing-pod-security-admission.html#security-context-constraints-psa-opting_understanding-and-managing-pod-security-admission

      https://kubernetes.io/docs/concepts/security/pod-security-admission/

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            rhn-support-jharmiso James Harmison
            RHOAI IDE
            Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:

              PagerDuty