Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-43730

NROP: RTE pods fail when installing 4.17.z operator on 4.18 OCP

XMLWordPrintable

    • None
    • CNF Compute Sprint 261, CNF Compute Sprint 262
    • 2
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      Starting 4.18 OCP, Openshift has updated the default SELinux policy which allows installing numaresources-operator and activating the RTE pods without using the custom SELinux policy previously required and previously delivered using Machineconfigs. This delivery using MachineConfigs required rebooting the configured worker nodes.
      Upgrading the operator from 4.17 to 4.18 is anticipated to take time to reconcile with the new settings. This will cause the previous MachineConfig created by the operator to be removed. During that time, RTE pods may go in CrashLoopBackOff state. This is expected and self-heals within few minutes when the reconciliation loop is over. The user can review the operator status by observing the conditions and the extra info under the created CR to get more details on what's pending.
      Users wishing to keep using the custom SELinux policy, or which don't want a longer upgrade, can set the `"config.node.openshift-kni.io/selinux-policy":"custom"` annotation to the NUMAResourcesOperator object prior to upgrade. Please note: once the annotation is removed, the operator will proceed deleting the MachineConfig and will change the selinux policy, causing worker node reboots. This enables users to control when they want to do this possible lenghty operation.
      Please note that the annotation is not considered part of the API, and will be removed and stop being supported with the 4.19 release.
      Show
      Starting 4.18 OCP, Openshift has updated the default SELinux policy which allows installing numaresources-operator and activating the RTE pods without using the custom SELinux policy previously required and previously delivered using Machineconfigs. This delivery using MachineConfigs required rebooting the configured worker nodes. Upgrading the operator from 4.17 to 4.18 is anticipated to take time to reconcile with the new settings. This will cause the previous MachineConfig created by the operator to be removed. During that time, RTE pods may go in CrashLoopBackOff state. This is expected and self-heals within few minutes when the reconciliation loop is over. The user can review the operator status by observing the conditions and the extra info under the created CR to get more details on what's pending. Users wishing to keep using the custom SELinux policy, or which don't want a longer upgrade, can set the `"config.node.openshift-kni.io/selinux-policy":"custom"` annotation to the NUMAResourcesOperator object prior to upgrade. Please note: once the annotation is removed, the operator will proceed deleting the MachineConfig and will change the selinux policy, causing worker node reboots. This enables users to control when they want to do this possible lenghty operation. Please note that the annotation is not considered part of the API, and will be removed and stop being supported with the 4.19 release.
    • Known Issue
    • In Progress
    • Install, Release Notes
    • Customer Facing

      Description of problem:

      When installing NROP 4.17.z on 4.18 OCP the RTE pods gets stuck on CrashLoopBackOff

       

      Version-Release number of selected component (if applicable):

      Reproducible on 4.18

      How reproducible:

      Everytime

      Steps to Reproduce:

      1. Installing any NROP build of 4.17.z will reproduce the issue

      Extra details:

      [root@helix36 ~]# oc get pods -n openshift-numaresources
      NAME                                                READY   STATUS             RESTARTS        AGE
      numaresources-controller-manager-666bd7f95d-j7596   1/1     Running            0               8m54s
      numaresourcesoperator-worker-cnf-z7f4m              1/2     CrashLoopBackOff   5 (2m27s ago)   5m19s
      

      Doing an oc describe on the affected pod

      Events:
        Type     Reason          Age                    From               Message
        ----     ------          ----                   ----               -------
        Normal   Scheduled       48m                    default-scheduler  Successfully assigned openshift-numaresources/numaresourcesoperator-worker-cnf-df5cw to ocp4183523817-worker-0.libvirt.lab.eng.tlv2.redhat.com
        Normal   AddedInterface  48m                    multus             Add eth0 [10.135.1.148/23] from ovn-kubernetes
        Normal   Pulled          48m                    kubelet            Container image "registry.redhat.io/openshift4/numaresources-rhel9-operator@sha256:e0d4722e0501ab8b3aad81e33539af17d65b949bf54579b272ee98fae58b8fbb" already present on machine
        Normal   Created         48m                    kubelet            Created container shared-pool-container
        Normal   Started         48m                    kubelet            Started container shared-pool-container
        Normal   Pulled          46m (x5 over 48m)      kubelet            Container image "registry.redhat.io/openshift4/numaresources-rhel9-operator@sha256:e0d4722e0501ab8b3aad81e33539af17d65b949bf54579b272ee98fae58b8fbb" already present on machine
        Normal   Created         46m (x5 over 48m)      kubelet            Created container resource-topology-exporter
        Normal   Started         46m (x5 over 48m)      kubelet            Started container resource-topology-exporter
        Warning  BackOff         3m17s (x208 over 48m)  kubelet            Back-off restarting failed container resource-topology-exporter in pod numaresourcesoperator-worker-cnf-df5cw_openshift-numaresources(3c2d9745-7514-4d2e-85f7-5dbb8a4e3df5)

      Getting the logs of the pod:

      [root@helix36 ~]# oc logs pod/numaresourcesoperator-worker-cnf-df5cw
      Defaulted container "resource-topology-exporter" out of: resource-topology-exporter, shared-pool-container
      I1023 09:23:33.824820       1 main.go:66] starting resource-topology-exporter 4.17.1 44f70579fcd67c1ebbd2aa338cebfc4712283874 go1.22.7 (Red Hat 1.22.7-1.el9_5) X:strictfipsruntime
      I1023 09:23:33.825128       1 main.go:307] using Topology Manager scope "container" from "conf" (conf=container) policy "single-numa-node" from "conf" (conf=single-numa-node)
      I1023 09:23:33.825566       1 client.go:43] creating a podresources client for endpoint "unix:///host-podresources/kubelet.sock"
      I1023 09:23:33.825581       1 client.go:104] endpoint "unix:///host-podresources/kubelet.sock" -> protocol="unix" path="/host-podresources/kubelet.sock"
      I1023 09:23:33.825923       1 client.go:48] created a podresources client for endpoint "unix:///host-podresources/kubelet.sock"
      I1023 09:23:33.825940       1 setup.go:90] metrics endpoint disabled
      I1023 09:23:33.825946       1 podexclude.go:99] > POD excludes:
      I1023 09:23:33.825954       1 resourcetopologyexporter.go:127] using given Topology Manager policy "single-numa-node" scope "container"
      I1023 09:23:33.825981       1 notification.go:123] added interval every 10s
      I1023 09:23:33.825997       1 resourcemonitor.go:153] resource monitor for "ocp4183523817-worker-0.libvirt.lab.eng.tlv2.redhat.com" starting
      I1023 09:23:33.847823       1 resourcemonitor.go:175] tracking node resources
      F1023 09:23:33.848264       1 main.go:118] failed to execute: failed to initialize ResourceMonitor: error while updating node allocatable: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /host-podresources/kubelet.sock: connect: permission denied"

      Inside the affected node getting the audit.log logs:

      [root@helix36 ~]# oc debug node/ocp4183523817-worker-0.libvirt.lab.eng.tlv2.redhat.com
      Starting pod/ocp4183523817-worker-0libvirtlabengtlv2redhatcom-debug-4dtzj ...
      To use host binaries, run `chroot /host`
      Pod IP: 192.168.122.79
      If you don't see a command prompt, try pressing enter.
      sh-5.1# chroot /host
      sh-5.1# tail -n 500 /var/log/audit/audit.log  | grep -i denied
      type=AVC msg=audit(1729673274.837:5117): avc:  denied  { write } for  pid=390531 comm="resource-topolo" name="kubelet.sock" dev="vda4" ino=56629082 scontext=system_u:system_r:rte.process:s0 tcontext=system_u:object_r:kubelet_var_lib_t:s0 tclass=sock_file permissive=0
      type=AVC msg=audit(1729673575.850:5150): avc:  denied  { write } for  pid=393240 comm="resource-topolo" name="kubelet.sock" dev="vda4" ino=56629082 scontext=system_u:system_r:rte.process:s0 tcontext=system_u:object_r:kubelet_var_lib_t:s0 tclass=sock_file permissive=0
      type=AVC msg=audit(1729673879.850:5177): avc:  denied  { write } for  pid=395969 comm="resource-topolo" name="kubelet.sock" dev="vda4" ino=56629082 scontext=system_u:system_r:rte.process:s0 tcontext=system_u:object_r:kubelet_var_lib_t:s0 tclass=sock_file permissive=0
      type=AVC msg=audit(1729674188.863:5218): avc:  denied  { write } for  pid=399077 comm="resource-topolo" name="kubelet.sock" dev="vda4" ino=56629082 scontext=system_u:system_r:rte.process:s0 tcontext=system_u:object_r:kubelet_var_lib_t:s0 tclass=sock_file permissive=0
      type=AVC msg=audit(1729674491.841:5251): avc:  denied  { write } for  pid=401753 comm="resource-topolo" name="kubelet.sock" dev="vda4" ino=56629082 scontext=system_u:system_r:rte.process:s0 tcontext=system_u:object_r:kubelet_var_lib_t:s0 tclass=sock_file permissive=0
      type=AVC msg=audit(1729674797.836:5276): avc:  denied  { write } for  pid=404337 comm="resource-topolo" name="kubelet.sock" dev="vda4" ino=56629082 scontext=system_u:system_r:rte.process:s0 tcontext=system_u:object_r:kubelet_var_lib_t:s0 tclass=sock_file permissive=0
      type=AVC msg=audit(1729675102.850:5317): avc:  denied  { write } for  pid=407229 comm="resource-topolo" name="kubelet.sock" dev="vda4" ino=56629082 scontext=system_u:system_r:rte.process:s0 tcontext=system_u:object_r:kubelet_var_lib_t:s0 tclass=sock_file permissive=0
      type=AVC msg=audit(1729675413.825:5344): avc:  denied  { write } for  pid=409881 comm="resource-topolo" name="kubelet.sock" dev="vda4" ino=56629082 scontext=system_u:system_r:rte.process:s0 tcontext=system_u:object_r:kubelet_var_lib_t:s0 tclass=sock_file permissive=0
      

      Actual results:

       

      Expected results:

          

      Additional info:

          

              rhn-support-shajmakh Shereen Haj
              rh-ee-rshemtov Roy Shemtov
              Roy Shemtov Roy Shemtov
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: