Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.14.z, 4.15.0, 4.16.0
Component/s: Node / Numa aware Scheduling
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Critical
Regression:
No

Target Backport Versions:

4.14.z, 4.15.0, 4.16.0
Target Version:

4.14.z
Release Blocker:
Rejected
Sprint:
None

RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
In Progress
Release Note Type:
Release Note Not Required
Release Note Text:
Fixes a selinux policy issue which prevented the topology-updater agents to connect to the kubelet to populate the per-NUMA resource allocation data, which in turn is necessary for the NUMA-aware scheduler to work.

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

    cluster is failing to install TAS operator because of permission denied to access the kubelet socket in the RTE pods, which caused them not to go running.

[EDIT Feb 9,2024] Note: This also occurs when installing productions NROP builds (at this moment 4.14.1) on top of stable OCP > 4.14.6. However, Nokia hit this on their environment on 4.14.6 while internally we didn't manage to reproduce that. 

u/s e2e test:
 
[Install] continuousIntegration with a running cluster with all the components [test_id:47574][tier1] should perform overall deployment and verify the condition is reported as available133/go/src/github.com/openshift-kni/numaresources-operator/test/e2e/install/install_test.go:66
...

I0116 09:26:29.871584   32860 install_test.go:393] NRO never reported available (1 DaemonSet)170I0116 09:26:29.871603   32860 install_test.go:396] daemonset openshift-numaresources/numaresourcesoperator-worker desired 3 scheduled 3 ready 0171I0116 09:26:30.087371   32860 install_test.go:420] DaemonSet openshift-numaresources/numaresourcesoperator-worker -> Pod openshift-numaresources/numaresourcesoperator-worker-5n5jf -> logs:172I0116 09:24:13.613642       1 main.go:65] starting resource-topology-exporter v0.4.14-rc1.dev70+g8efe4d6a 8efe4d6a go1.20.6173I0116 09:24:13.613792       1 main.go:294] using Topology Manager scope "pod" from "conf" (conf=pod) policy "single-numa-node" from "conf" (conf=single-numa-node)174I0116 09:24:13.614306       1 client.go:43] creating a podresources client for endpoint "unix:///host-podresources/kubelet.sock"175I0116 09:24:13.614321       1 client.go:104] endpoint "unix:///host-podresources/kubelet.sock" -> protocol="unix" path="/host-podresources/kubelet.sock"176I0116 09:24:13.614590       1 client.go:48] created a podresources client for endpoint "unix:///host-podresources/kubelet.sock"177I0116 09:24:13.614608       1 prometheus.go:113] prometheus endpoint disabled178I0116 09:24:13.614614       1 podexclude.go:99] > POD excludes:179I0116 09:24:13.614624       1 resourcetopologyexporter.go:127] using given Topology Manager policy "single-numa-node" scope "pod"180I0116 09:24:13.614653       1 notification.go:123] added interval every 10s181I0116 09:24:13.614678       1 resourcemonitor.go:153] resource monitor for "ip-10-0-13-251.us-west-2.compute.internal" starting182I0116 09:24:13.621700       1 resourcemonitor.go:165] machine topology: {"architecture":"smp","nodes":[{"id":0,"cores":[{"id":0,"index":0,"total_threads":2,"logical_processors":[0,2]},{"id":1,"index":1,"total_threads":2,"logical_processors":[1,3]}],"caches":[{"level":1,"type":"instruction","size_bytes":32768,"logical_processors":[0,2]},{"level":1,"type":"instruction","size_bytes":32768,"logical_processors":[1,3]},{"level":1,"type":"data","size_bytes":32768,"logical_processors":[0,2]},{"level":1,"type":"data","size_bytes":32768,"logical_processors":[1,3]},{"level":2,"type":"unified","size_bytes":524288,"logical_processors":[0,2]},{"level":2,"type":"unified","size_bytes":524288,"logical_processors":[1,3]},{"level":3,"type":"unified","size_bytes":8388608,"logical_processors":[0,1,2,3]}],"distances":[10],"memory":{"total_physical_bytes":16911433728,"total_usable_bytes":16389009408,"supported_page_sizes":[1073741824,2097152],"modules":null}}]}183I0116 09:24:13.621719       1 resourcemonitor.go:175] tracking node resources184F0116 09:24:13.621964       1 main.go:112] failed to execute: failed to initialize ResourceMonitor: error while updating node allocatable: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /host-podresources/kubelet.sock: connect: permission denied"185

Version-Release number of selected component (if applicable):

   4.14-4.16. 4.13 wasn't verified yet.

How reproducible:

    always in u/s ci

Steps to Reproduce:

    1.deploy latest OCP nightly
    2. install NROP and notice the ds never available because of permission denied to access kubelet socket (according to the ds pod logs)
    3.

Actual results:

    NROP RTE pods fail to run

Expected results:

    installation should complete successfully

Additional info:

    this is suspected to be caused by latest changes in container-selinux package in RHCOS: 
 https://github.com/containers/container-selinux/pull/291
 https://github.com/containers/container-selinux/pull/295
affected package version starting 2.227.

clones

OCPBUGS-27488 4.14 failed installing NROP: failed to connect to kubelet.sock

Closed

links to

https://github.com/openshift-kni/numaresources-operator/pull/860

RHBA-2023:124767 OpenShift Container Platform 4.13.37 low-latency extras update

RHEA-2023:125676 OpenShift Container Platform 4.14.12 low-latency extras update

Assignee:: Francesco Romani

Reporter:: Francesco Romani

Need Info From:: None

Contributors:: None

QA Contact:: Shereen Haj

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2024/03/08 11:33 AM

Updated:: 2025/07/23 11:37 AM

Resolved:: 2024/03/13 12:14 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide