Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Major
Fix Version/s: None
Affects Version/s: 4.18
Component/s: Node / Numa aware Scheduling
Labels:

Regression:
None
Sprint:
CNF Compute Sprint 261, CNF Compute Sprint 262
sprint_count:
2
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
Starting 4.18 OCP, Openshift has updated the default SELinux policy which allows installing numaresources-operator and activating the RTE pods without using the custom SELinux policy previously required and previously delivered using Machineconfigs. This delivery using MachineConfigs required rebooting the configured worker nodes.
Upgrading the operator from 4.17 to 4.18 is anticipated to take time to reconcile with the new settings. This will cause the previous MachineConfig created by the operator to be removed. During that time, RTE pods may go in CrashLoopBackOff state. This is expected and self-heals within few minutes when the reconciliation loop is over. The user can review the operator status by observing the conditions and the extra info under the created CR to get more details on what's pending.
Users wishing to keep using the custom SELinux policy, or which don't want a longer upgrade, can set the `"config.node.openshift-kni.io/selinux-policy":"custom"` annotation to the NUMAResourcesOperator object prior to upgrade. Please note: once the annotation is removed, the operator will proceed deleting the MachineConfig and will change the selinux policy, causing worker node reboots. This enables users to control when they want to do this possible lenghty operation.
Please note that the annotation is not considered part of the API, and will be removed and stop being supported with the 4.19 release.

Show
Starting 4.18 OCP, Openshift has updated the default SELinux policy which allows installing numaresources-operator and activating the RTE pods without using the custom SELinux policy previously required and previously delivered using Machineconfigs. This delivery using MachineConfigs required rebooting the configured worker nodes. Upgrading the operator from 4.17 to 4.18 is anticipated to take time to reconcile with the new settings. This will cause the previous MachineConfig created by the operator to be removed. During that time, RTE pods may go in CrashLoopBackOff state. This is expected and self-heals within few minutes when the reconciliation loop is over. The user can review the operator status by observing the conditions and the extra info under the created CR to get more details on what's pending. Users wishing to keep using the custom SELinux policy, or which don't want a longer upgrade, can set the `"config.node.openshift-kni.io/selinux-policy":"custom"` annotation to the NUMAResourcesOperator object prior to upgrade. Please note: once the annotation is removed, the operator will proceed deleting the MachineConfig and will change the selinux policy, causing worker node reboots. This enables users to control when they want to do this possible lenghty operation. Please note that the annotation is not considered part of the API, and will be removed and stop being supported with the 4.19 release.
Release Note Type:
Known Issue
Release Note Status:
In Progress
Documentation Type:

Install, Release Notes
Customer Impact:

Customer Facing
Internal Whiteboard:
RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

When installing NROP 4.17.z on 4.18 OCP the RTE pods gets stuck on CrashLoopBackOff

Version-Release number of selected component (if applicable):

Reproducible on 4.18

How reproducible:

Everytime

Steps to Reproduce:

1. Installing any NROP build of 4.17.z will reproduce the issue

Extra details:

[root@helix36 ~]# oc get pods -n openshift-numaresources
NAME                                                READY   STATUS             RESTARTS        AGE
numaresources-controller-manager-666bd7f95d-j7596   1/1     Running            0               8m54s
numaresourcesoperator-worker-cnf-z7f4m              1/2     CrashLoopBackOff   5 (2m27s ago)   5m19s

Doing an oc describe on the affected pod

Events:
  Type     Reason          Age                    From               Message
  ----     ------          ----                   ----               -------
  Normal   Scheduled       48m                    default-scheduler  Successfully assigned openshift-numaresources/numaresourcesoperator-worker-cnf-df5cw to ocp4183523817-worker-0.libvirt.lab.eng.tlv2.redhat.com
  Normal   AddedInterface  48m                    multus             Add eth0 [10.135.1.148/23] from ovn-kubernetes
  Normal   Pulled          48m                    kubelet            Container image "registry.redhat.io/openshift4/numaresources-rhel9-operator@sha256:e0d4722e0501ab8b3aad81e33539af17d65b949bf54579b272ee98fae58b8fbb" already present on machine
  Normal   Created         48m                    kubelet            Created container shared-pool-container
  Normal   Started         48m                    kubelet            Started container shared-pool-container
  Normal   Pulled          46m (x5 over 48m)      kubelet            Container image "registry.redhat.io/openshift4/numaresources-rhel9-operator@sha256:e0d4722e0501ab8b3aad81e33539af17d65b949bf54579b272ee98fae58b8fbb" already present on machine
  Normal   Created         46m (x5 over 48m)      kubelet            Created container resource-topology-exporter
  Normal   Started         46m (x5 over 48m)      kubelet            Started container resource-topology-exporter
  Warning  BackOff         3m17s (x208 over 48m)  kubelet            Back-off restarting failed container resource-topology-exporter in pod numaresourcesoperator-worker-cnf-df5cw_openshift-numaresources(3c2d9745-7514-4d2e-85f7-5dbb8a4e3df5)

Getting the logs of the pod:

[root@helix36 ~]# oc logs pod/numaresourcesoperator-worker-cnf-df5cw
Defaulted container "resource-topology-exporter" out of: resource-topology-exporter, shared-pool-container
I1023 09:23:33.824820       1 main.go:66] starting resource-topology-exporter 4.17.1 44f70579fcd67c1ebbd2aa338cebfc4712283874 go1.22.7 (Red Hat 1.22.7-1.el9_5) X:strictfipsruntime
I1023 09:23:33.825128       1 main.go:307] using Topology Manager scope "container" from "conf" (conf=container) policy "single-numa-node" from "conf" (conf=single-numa-node)
I1023 09:23:33.825566       1 client.go:43] creating a podresources client for endpoint "unix:///host-podresources/kubelet.sock"
I1023 09:23:33.825581       1 client.go:104] endpoint "unix:///host-podresources/kubelet.sock" -> protocol="unix" path="/host-podresources/kubelet.sock"
I1023 09:23:33.825923       1 client.go:48] created a podresources client for endpoint "unix:///host-podresources/kubelet.sock"
I1023 09:23:33.825940       1 setup.go:90] metrics endpoint disabled
I1023 09:23:33.825946       1 podexclude.go:99] > POD excludes:
I1023 09:23:33.825954       1 resourcetopologyexporter.go:127] using given Topology Manager policy "single-numa-node" scope "container"
I1023 09:23:33.825981       1 notification.go:123] added interval every 10s
I1023 09:23:33.825997       1 resourcemonitor.go:153] resource monitor for "ocp4183523817-worker-0.libvirt.lab.eng.tlv2.redhat.com" starting
I1023 09:23:33.847823       1 resourcemonitor.go:175] tracking node resources
F1023 09:23:33.848264       1 main.go:118] failed to execute: failed to initialize ResourceMonitor: error while updating node allocatable: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /host-podresources/kubelet.sock: connect: permission denied"

Inside the affected node getting the audit.log logs:

[root@helix36 ~]# oc debug node/ocp4183523817-worker-0.libvirt.lab.eng.tlv2.redhat.com
Starting pod/ocp4183523817-worker-0libvirtlabengtlv2redhatcom-debug-4dtzj ...
To use host binaries, run `chroot /host`
Pod IP: 192.168.122.79
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# tail -n 500 /var/log/audit/audit.log  | grep -i denied
type=AVC msg=audit(1729673274.837:5117): avc:  denied  { write } for  pid=390531 comm="resource-topolo" name="kubelet.sock" dev="vda4" ino=56629082 scontext=system_u:system_r:rte.process:s0 tcontext=system_u:object_r:kubelet_var_lib_t:s0 tclass=sock_file permissive=0
type=AVC msg=audit(1729673575.850:5150): avc:  denied  { write } for  pid=393240 comm="resource-topolo" name="kubelet.sock" dev="vda4" ino=56629082 scontext=system_u:system_r:rte.process:s0 tcontext=system_u:object_r:kubelet_var_lib_t:s0 tclass=sock_file permissive=0
type=AVC msg=audit(1729673879.850:5177): avc:  denied  { write } for  pid=395969 comm="resource-topolo" name="kubelet.sock" dev="vda4" ino=56629082 scontext=system_u:system_r:rte.process:s0 tcontext=system_u:object_r:kubelet_var_lib_t:s0 tclass=sock_file permissive=0
type=AVC msg=audit(1729674188.863:5218): avc:  denied  { write } for  pid=399077 comm="resource-topolo" name="kubelet.sock" dev="vda4" ino=56629082 scontext=system_u:system_r:rte.process:s0 tcontext=system_u:object_r:kubelet_var_lib_t:s0 tclass=sock_file permissive=0
type=AVC msg=audit(1729674491.841:5251): avc:  denied  { write } for  pid=401753 comm="resource-topolo" name="kubelet.sock" dev="vda4" ino=56629082 scontext=system_u:system_r:rte.process:s0 tcontext=system_u:object_r:kubelet_var_lib_t:s0 tclass=sock_file permissive=0
type=AVC msg=audit(1729674797.836:5276): avc:  denied  { write } for  pid=404337 comm="resource-topolo" name="kubelet.sock" dev="vda4" ino=56629082 scontext=system_u:system_r:rte.process:s0 tcontext=system_u:object_r:kubelet_var_lib_t:s0 tclass=sock_file permissive=0
type=AVC msg=audit(1729675102.850:5317): avc:  denied  { write } for  pid=407229 comm="resource-topolo" name="kubelet.sock" dev="vda4" ino=56629082 scontext=system_u:system_r:rte.process:s0 tcontext=system_u:object_r:kubelet_var_lib_t:s0 tclass=sock_file permissive=0
type=AVC msg=audit(1729675413.825:5344): avc:  denied  { write } for  pid=409881 comm="resource-topolo" name="kubelet.sock" dev="vda4" ino=56629082 scontext=system_u:system_r:rte.process:s0 tcontext=system_u:object_r:kubelet_var_lib_t:s0 tclass=sock_file permissive=0

Actual results:

Expected results:

Additional info:

links to

https://github.com/openshift-kni/numaresources-operator/pull/1058

Assignee:: Shereen Haj

Reporter:: Roy Shemtov

QA Contact:: Roy Shemtov

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2024/10/23 9:28 AM

Updated:: 2024/11/20 2:53 PM

Resolved:: 2024/11/20 2:53 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates