Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-30624

4.13 NROP RTE: failed to connect to kubelet.sock

XMLWordPrintable

    • Critical
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Fixes a selinux policy issue which prevented the topology-updater agents to connect to the kubelet to populate the per-NUMA resource allocation data, which in turn is necessary for the NUMA-aware scheduler to work.
    • Release Note Not Required
    • In Progress

      Description of problem:

          cluster is failing to install TAS operator because of permission denied to access the kubelet socket in the RTE pods, which caused them not to go running.
      
      [EDIT Feb 9,2024] Note: This also occurs when installing productions NROP builds (at this moment 4.14.1) on top of stable OCP > 4.14.6. However, Nokia hit this on their environment on 4.14.6 while internally we didn't manage to reproduce that. 
      
      u/s e2e test:
       
      [Install] continuousIntegration with a running cluster with all the components [test_id:47574][tier1] should perform overall deployment and verify the condition is reported as available133/go/src/github.com/openshift-kni/numaresources-operator/test/e2e/install/install_test.go:66
      ...
      
      I0116 09:26:29.871584   32860 install_test.go:393] NRO never reported available (1 DaemonSet)170I0116 09:26:29.871603   32860 install_test.go:396] daemonset openshift-numaresources/numaresourcesoperator-worker desired 3 scheduled 3 ready 0171I0116 09:26:30.087371   32860 install_test.go:420] DaemonSet openshift-numaresources/numaresourcesoperator-worker -> Pod openshift-numaresources/numaresourcesoperator-worker-5n5jf -> logs:172I0116 09:24:13.613642       1 main.go:65] starting resource-topology-exporter v0.4.14-rc1.dev70+g8efe4d6a 8efe4d6a go1.20.6173I0116 09:24:13.613792       1 main.go:294] using Topology Manager scope "pod" from "conf" (conf=pod) policy "single-numa-node" from "conf" (conf=single-numa-node)174I0116 09:24:13.614306       1 client.go:43] creating a podresources client for endpoint "unix:///host-podresources/kubelet.sock"175I0116 09:24:13.614321       1 client.go:104] endpoint "unix:///host-podresources/kubelet.sock" -> protocol="unix" path="/host-podresources/kubelet.sock"176I0116 09:24:13.614590       1 client.go:48] created a podresources client for endpoint "unix:///host-podresources/kubelet.sock"177I0116 09:24:13.614608       1 prometheus.go:113] prometheus endpoint disabled178I0116 09:24:13.614614       1 podexclude.go:99] > POD excludes:179I0116 09:24:13.614624       1 resourcetopologyexporter.go:127] using given Topology Manager policy "single-numa-node" scope "pod"180I0116 09:24:13.614653       1 notification.go:123] added interval every 10s181I0116 09:24:13.614678       1 resourcemonitor.go:153] resource monitor for "ip-10-0-13-251.us-west-2.compute.internal" starting182I0116 09:24:13.621700       1 resourcemonitor.go:165] machine topology: {"architecture":"smp","nodes":[{"id":0,"cores":[{"id":0,"index":0,"total_threads":2,"logical_processors":[0,2]},{"id":1,"index":1,"total_threads":2,"logical_processors":[1,3]}],"caches":[{"level":1,"type":"instruction","size_bytes":32768,"logical_processors":[0,2]},{"level":1,"type":"instruction","size_bytes":32768,"logical_processors":[1,3]},{"level":1,"type":"data","size_bytes":32768,"logical_processors":[0,2]},{"level":1,"type":"data","size_bytes":32768,"logical_processors":[1,3]},{"level":2,"type":"unified","size_bytes":524288,"logical_processors":[0,2]},{"level":2,"type":"unified","size_bytes":524288,"logical_processors":[1,3]},{"level":3,"type":"unified","size_bytes":8388608,"logical_processors":[0,1,2,3]}],"distances":[10],"memory":{"total_physical_bytes":16911433728,"total_usable_bytes":16389009408,"supported_page_sizes":[1073741824,2097152],"modules":null}}]}183I0116 09:24:13.621719       1 resourcemonitor.go:175] tracking node resources184F0116 09:24:13.621964       1 main.go:112] failed to execute: failed to initialize ResourceMonitor: error while updating node allocatable: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /host-podresources/kubelet.sock: connect: permission denied"185
      
      

      Version-Release number of selected component (if applicable):

         4.14-4.16. 4.13 wasn't verified yet. 

      How reproducible:

          always in u/s ci 

      Steps to Reproduce:

          1.deploy latest OCP nightly
          2. install NROP and notice the ds never available because of permission denied to access kubelet socket (according to the ds pod logs)
          3.
          

      Actual results:

          NROP RTE pods fail to run

      Expected results:

          installation should complete successfully

      Additional info:

          this is suspected to be caused by latest changes in container-selinux package in RHCOS: 
       https://github.com/containers/container-selinux/pull/291
       https://github.com/containers/container-selinux/pull/295
      affected package version starting 2.227.
      
      

              fromani@redhat.com Francesco Romani
              fromani@redhat.com Francesco Romani
              Shereen Haj Shereen Haj
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: