Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-43730

NROP: RTE pods fail when installing 4.17.z operator on 4.18 OCP

XMLWordPrintable

    • None
    • CNF Compute Sprint 261, CNF Compute Sprint 262
    • 2
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      NUMA Resources Operator now uses default SELinux policy

      With this release, the NUMA Resources Operator no longer creates a custom SELinux policy to enable the installation of Operator components on a target node. Instead, the Operator uses a built-in container SELinux policy. This change removes the additional node reboot that was previously required when applying a custom SELinux policy during an installation.

      IMPORTANT: During an upgrade to OCP 4.18, the NUMA Resources Operator removes the `MachineConfig` resource that previously applied the custom SELinux policy. This action triggers one additional reboot for each configured node. Also, resource topology exporter pods might temporarily enter a `CrashLoopBackOff` state during this process. This is expected behavior as the Operator transitions to the built-in SELinux policy. If you need to defer the additional reboot you can apply an annotation before upgrading using the following patch command:

      oc patch numaresourcesoperators.nodetopology.openshift.io numaresourcesoperator \
        --type='json' -p '[{"op": "add", "path": "/spec/nodeGroups/0/annotations", "value": {"config.node.openshift-kni.io/selinux-policy":"custom"}}]'

      This patch example will patch the `numaresourcesoperators` object named `numaresourcesoperator` adding the `config.node.openshift-kni.io/selinux-policy:"custom"` annotation to the `nodeGroups` object at index `0`.

      You can use this annotation to manage the timing of the transition. However, when the annotation is removed, the target nodes will reboot. This annotation is not part of the `NUMAResourcesOperator` API and will be deprecated in a future release.
      Show
      NUMA Resources Operator now uses default SELinux policy With this release, the NUMA Resources Operator no longer creates a custom SELinux policy to enable the installation of Operator components on a target node. Instead, the Operator uses a built-in container SELinux policy. This change removes the additional node reboot that was previously required when applying a custom SELinux policy during an installation. IMPORTANT: During an upgrade to OCP 4.18, the NUMA Resources Operator removes the `MachineConfig` resource that previously applied the custom SELinux policy. This action triggers one additional reboot for each configured node. Also, resource topology exporter pods might temporarily enter a `CrashLoopBackOff` state during this process. This is expected behavior as the Operator transitions to the built-in SELinux policy. If you need to defer the additional reboot you can apply an annotation before upgrading using the following patch command: oc patch numaresourcesoperators.nodetopology.openshift.io numaresourcesoperator \   --type='json' -p '[{"op": "add", "path": "/spec/nodeGroups/0/annotations", "value": {"config.node.openshift-kni.io/selinux-policy":"custom"}}]' This patch example will patch the `numaresourcesoperators` object named `numaresourcesoperator` adding the `config.node.openshift-kni.io/selinux-policy:"custom"` annotation to the `nodeGroups` object at index `0`. You can use this annotation to manage the timing of the transition. However, when the annotation is removed, the target nodes will reboot. This annotation is not part of the `NUMAResourcesOperator` API and will be deprecated in a future release.
    • Enhancement
    • Done
    • Install, Release Notes
    • Customer Facing

      Description of problem:

      When installing NROP 4.17.z on 4.18 OCP the RTE pods gets stuck on CrashLoopBackOff

       

      Version-Release number of selected component (if applicable):

      Reproducible on 4.18

      How reproducible:

      Everytime

      Steps to Reproduce:

      1. Installing any NROP build of 4.17.z will reproduce the issue

      Extra details:

      [root@helix36 ~]# oc get pods -n openshift-numaresources
      NAME                                                READY   STATUS             RESTARTS        AGE
      numaresources-controller-manager-666bd7f95d-j7596   1/1     Running            0               8m54s
      numaresourcesoperator-worker-cnf-z7f4m              1/2     CrashLoopBackOff   5 (2m27s ago)   5m19s
      

      Doing an oc describe on the affected pod

      Events:
        Type     Reason          Age                    From               Message
        ----     ------          ----                   ----               -------
        Normal   Scheduled       48m                    default-scheduler  Successfully assigned openshift-numaresources/numaresourcesoperator-worker-cnf-df5cw to ocp4183523817-worker-0.libvirt.lab.eng.tlv2.redhat.com
        Normal   AddedInterface  48m                    multus             Add eth0 [10.135.1.148/23] from ovn-kubernetes
        Normal   Pulled          48m                    kubelet            Container image "registry.redhat.io/openshift4/numaresources-rhel9-operator@sha256:e0d4722e0501ab8b3aad81e33539af17d65b949bf54579b272ee98fae58b8fbb" already present on machine
        Normal   Created         48m                    kubelet            Created container shared-pool-container
        Normal   Started         48m                    kubelet            Started container shared-pool-container
        Normal   Pulled          46m (x5 over 48m)      kubelet            Container image "registry.redhat.io/openshift4/numaresources-rhel9-operator@sha256:e0d4722e0501ab8b3aad81e33539af17d65b949bf54579b272ee98fae58b8fbb" already present on machine
        Normal   Created         46m (x5 over 48m)      kubelet            Created container resource-topology-exporter
        Normal   Started         46m (x5 over 48m)      kubelet            Started container resource-topology-exporter
        Warning  BackOff         3m17s (x208 over 48m)  kubelet            Back-off restarting failed container resource-topology-exporter in pod numaresourcesoperator-worker-cnf-df5cw_openshift-numaresources(3c2d9745-7514-4d2e-85f7-5dbb8a4e3df5)

      Getting the logs of the pod:

      [root@helix36 ~]# oc logs pod/numaresourcesoperator-worker-cnf-df5cw
      Defaulted container "resource-topology-exporter" out of: resource-topology-exporter, shared-pool-container
      I1023 09:23:33.824820       1 main.go:66] starting resource-topology-exporter 4.17.1 44f70579fcd67c1ebbd2aa338cebfc4712283874 go1.22.7 (Red Hat 1.22.7-1.el9_5) X:strictfipsruntime
      I1023 09:23:33.825128       1 main.go:307] using Topology Manager scope "container" from "conf" (conf=container) policy "single-numa-node" from "conf" (conf=single-numa-node)
      I1023 09:23:33.825566       1 client.go:43] creating a podresources client for endpoint "unix:///host-podresources/kubelet.sock"
      I1023 09:23:33.825581       1 client.go:104] endpoint "unix:///host-podresources/kubelet.sock" -> protocol="unix" path="/host-podresources/kubelet.sock"
      I1023 09:23:33.825923       1 client.go:48] created a podresources client for endpoint "unix:///host-podresources/kubelet.sock"
      I1023 09:23:33.825940       1 setup.go:90] metrics endpoint disabled
      I1023 09:23:33.825946       1 podexclude.go:99] > POD excludes:
      I1023 09:23:33.825954       1 resourcetopologyexporter.go:127] using given Topology Manager policy "single-numa-node" scope "container"
      I1023 09:23:33.825981       1 notification.go:123] added interval every 10s
      I1023 09:23:33.825997       1 resourcemonitor.go:153] resource monitor for "ocp4183523817-worker-0.libvirt.lab.eng.tlv2.redhat.com" starting
      I1023 09:23:33.847823       1 resourcemonitor.go:175] tracking node resources
      F1023 09:23:33.848264       1 main.go:118] failed to execute: failed to initialize ResourceMonitor: error while updating node allocatable: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /host-podresources/kubelet.sock: connect: permission denied"

      Inside the affected node getting the audit.log logs:

      [root@helix36 ~]# oc debug node/ocp4183523817-worker-0.libvirt.lab.eng.tlv2.redhat.com
      Starting pod/ocp4183523817-worker-0libvirtlabengtlv2redhatcom-debug-4dtzj ...
      To use host binaries, run `chroot /host`
      Pod IP: 192.168.122.79
      If you don't see a command prompt, try pressing enter.
      sh-5.1# chroot /host
      sh-5.1# tail -n 500 /var/log/audit/audit.log  | grep -i denied
      type=AVC msg=audit(1729673274.837:5117): avc:  denied  { write } for  pid=390531 comm="resource-topolo" name="kubelet.sock" dev="vda4" ino=56629082 scontext=system_u:system_r:rte.process:s0 tcontext=system_u:object_r:kubelet_var_lib_t:s0 tclass=sock_file permissive=0
      type=AVC msg=audit(1729673575.850:5150): avc:  denied  { write } for  pid=393240 comm="resource-topolo" name="kubelet.sock" dev="vda4" ino=56629082 scontext=system_u:system_r:rte.process:s0 tcontext=system_u:object_r:kubelet_var_lib_t:s0 tclass=sock_file permissive=0
      type=AVC msg=audit(1729673879.850:5177): avc:  denied  { write } for  pid=395969 comm="resource-topolo" name="kubelet.sock" dev="vda4" ino=56629082 scontext=system_u:system_r:rte.process:s0 tcontext=system_u:object_r:kubelet_var_lib_t:s0 tclass=sock_file permissive=0
      type=AVC msg=audit(1729674188.863:5218): avc:  denied  { write } for  pid=399077 comm="resource-topolo" name="kubelet.sock" dev="vda4" ino=56629082 scontext=system_u:system_r:rte.process:s0 tcontext=system_u:object_r:kubelet_var_lib_t:s0 tclass=sock_file permissive=0
      type=AVC msg=audit(1729674491.841:5251): avc:  denied  { write } for  pid=401753 comm="resource-topolo" name="kubelet.sock" dev="vda4" ino=56629082 scontext=system_u:system_r:rte.process:s0 tcontext=system_u:object_r:kubelet_var_lib_t:s0 tclass=sock_file permissive=0
      type=AVC msg=audit(1729674797.836:5276): avc:  denied  { write } for  pid=404337 comm="resource-topolo" name="kubelet.sock" dev="vda4" ino=56629082 scontext=system_u:system_r:rte.process:s0 tcontext=system_u:object_r:kubelet_var_lib_t:s0 tclass=sock_file permissive=0
      type=AVC msg=audit(1729675102.850:5317): avc:  denied  { write } for  pid=407229 comm="resource-topolo" name="kubelet.sock" dev="vda4" ino=56629082 scontext=system_u:system_r:rte.process:s0 tcontext=system_u:object_r:kubelet_var_lib_t:s0 tclass=sock_file permissive=0
      type=AVC msg=audit(1729675413.825:5344): avc:  denied  { write } for  pid=409881 comm="resource-topolo" name="kubelet.sock" dev="vda4" ino=56629082 scontext=system_u:system_r:rte.process:s0 tcontext=system_u:object_r:kubelet_var_lib_t:s0 tclass=sock_file permissive=0
      

      Actual results:

       

      Expected results:

          

      Additional info:

          

              rhn-support-shajmakh Shereen Haj
              rh-ee-rshemtov Roy Shemtov
              Roy Shemtov Roy Shemtov
              Ronan Hennessy Ronan Hennessy
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: