-
Bug
-
Resolution: Done-Errata
-
Critical
-
None
-
4.14.z, 4.15.0, 4.16.0
-
Critical
-
No
-
Rejected
-
False
-
-
Fixes a selinux policy issue which prevented the topology-updater agents to connect to the kubelet to populate the per-NUMA resource allocation data, which in turn is necessary for the NUMA-aware scheduler to work.
-
Release Note Not Required
-
In Progress
-
Description of problem:
cluster is failing to install TAS operator because of permission denied to access the kubelet socket in the RTE pods, which caused them not to go running. [EDIT Feb 9,2024] Note: This also occurs when installing productions NROP builds (at this moment 4.14.1) on top of stable OCP > 4.14.6. However, Nokia hit this on their environment on 4.14.6 while internally we didn't manage to reproduce that. u/s e2e test: [Install] continuousIntegration with a running cluster with all the components [test_id:47574][tier1] should perform overall deployment and verify the condition is reported as available133/go/src/github.com/openshift-kni/numaresources-operator/test/e2e/install/install_test.go:66 ... I0116 09:26:29.871584 32860 install_test.go:393] NRO never reported available (1 DaemonSet)170I0116 09:26:29.871603 32860 install_test.go:396] daemonset openshift-numaresources/numaresourcesoperator-worker desired 3 scheduled 3 ready 0171I0116 09:26:30.087371 32860 install_test.go:420] DaemonSet openshift-numaresources/numaresourcesoperator-worker -> Pod openshift-numaresources/numaresourcesoperator-worker-5n5jf -> logs:172I0116 09:24:13.613642 1 main.go:65] starting resource-topology-exporter v0.4.14-rc1.dev70+g8efe4d6a 8efe4d6a go1.20.6173I0116 09:24:13.613792 1 main.go:294] using Topology Manager scope "pod" from "conf" (conf=pod) policy "single-numa-node" from "conf" (conf=single-numa-node)174I0116 09:24:13.614306 1 client.go:43] creating a podresources client for endpoint "unix:///host-podresources/kubelet.sock"175I0116 09:24:13.614321 1 client.go:104] endpoint "unix:///host-podresources/kubelet.sock" -> protocol="unix" path="/host-podresources/kubelet.sock"176I0116 09:24:13.614590 1 client.go:48] created a podresources client for endpoint "unix:///host-podresources/kubelet.sock"177I0116 09:24:13.614608 1 prometheus.go:113] prometheus endpoint disabled178I0116 09:24:13.614614 1 podexclude.go:99] > POD excludes:179I0116 09:24:13.614624 1 resourcetopologyexporter.go:127] using given Topology Manager policy "single-numa-node" scope "pod"180I0116 09:24:13.614653 1 notification.go:123] added interval every 10s181I0116 09:24:13.614678 1 resourcemonitor.go:153] resource monitor for "ip-10-0-13-251.us-west-2.compute.internal" starting182I0116 09:24:13.621700 1 resourcemonitor.go:165] machine topology: {"architecture":"smp","nodes":[{"id":0,"cores":[{"id":0,"index":0,"total_threads":2,"logical_processors":[0,2]},{"id":1,"index":1,"total_threads":2,"logical_processors":[1,3]}],"caches":[{"level":1,"type":"instruction","size_bytes":32768,"logical_processors":[0,2]},{"level":1,"type":"instruction","size_bytes":32768,"logical_processors":[1,3]},{"level":1,"type":"data","size_bytes":32768,"logical_processors":[0,2]},{"level":1,"type":"data","size_bytes":32768,"logical_processors":[1,3]},{"level":2,"type":"unified","size_bytes":524288,"logical_processors":[0,2]},{"level":2,"type":"unified","size_bytes":524288,"logical_processors":[1,3]},{"level":3,"type":"unified","size_bytes":8388608,"logical_processors":[0,1,2,3]}],"distances":[10],"memory":{"total_physical_bytes":16911433728,"total_usable_bytes":16389009408,"supported_page_sizes":[1073741824,2097152],"modules":null}}]}183I0116 09:24:13.621719 1 resourcemonitor.go:175] tracking node resources184F0116 09:24:13.621964 1 main.go:112] failed to execute: failed to initialize ResourceMonitor: error while updating node allocatable: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /host-podresources/kubelet.sock: connect: permission denied"185
Version-Release number of selected component (if applicable):
4.14-4.16. 4.13 wasn't verified yet.
How reproducible:
always in u/s ci
Steps to Reproduce:
1.deploy latest OCP nightly 2. install NROP and notice the ds never available because of permission denied to access kubelet socket (according to the ds pod logs) 3.
Actual results:
NROP RTE pods fail to run
Expected results:
installation should complete successfully
Additional info:
this is suspected to be caused by latest changes in container-selinux package in RHCOS: https://github.com/containers/container-selinux/pull/291 https://github.com/containers/container-selinux/pull/295 affected package version starting 2.227.
- clones
-
OCPBUGS-27488 4.14 failed installing NROP: failed to connect to kubelet.sock
- Closed
- links to
-
RHBA-2023:124767 OpenShift Container Platform 4.13.37 low-latency extras update
-
RHEA-2023:125676 OpenShift Container Platform 4.14.12 low-latency extras update