Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-30064

Fails to apply performanceprofile, node stuck on Ready/NotReady, SchedulingDisabled

    XMLWordPrintable

Details

    • No
    • CNF Compute Sprint 250, CNF Compute Sprint 251
    • 2
    • False
    • Hide

      None

      Show
      None
    • 2024-03-11: Must gather logs did not show an issue, asked for new logs if possible. Similar issue linked to this bug. Will followup with a live deploy session on the environment if no evidence comes up

    Description

      Description of problem:

      Fails to apply performanceprofile, after complete cluster nodes reboot to downgrad cgroup version from v2 to v1, first node stuck on SchedulingDisabled.
      There are two types of failures observerd:
        - first, and most common, relevant mcp stuck on pause and node stuck on Ready,SchedulingDisabled. On first reboot that had to reduce cgroup version from the v2 to v1.
        - second, node rebooted (additional reboot to apply changes), changes aplayed and node is stuck on NotReady,SchedulingDisabled in Updating state forever.

      Version-Release number of selected component (if applicable):

          4.15.0 (GA)

      How reproducible:

          always

      Steps to Reproduce:

          1. deploy disconnected cluster
          2. apply or create performanceprofile config
      
          3.1 wait for the relevant mcp node will change state to the Ready,SchedulingDisabled     
      
          3.2. wait for the complete cluster reboot
          4. wait for another reboot only for the relevant mcp to apply pp config on nodes 
          

      Actual results:

          first case: relevant mcp node stuck on Ready,SchedulingDisabled  with paused mcp; no reboot. 
          second case: node stuck on NotReady,SchedulingDisabled in Updating stage

      Expected results:

          relevant mcp nodes reboot, pp config applied to the nodes, cgroup downgraded to the v1

      Additional info:

          For the first case: manual un-pausing mcp doesn't give any result, node stay stuck on Ready,SchedulingDisabled state, mcp stuck in Updating state forever. All nodes belongs to this mcp didn't change cgroup to the v1.
      
      
      Logs: must gather for the both cases can be found at https://file.emea.redhat.com/~elgerman/OCPBUGS-30064/
      (each log was collected on the clean/new ocp deployment)
      
      

       

      Attachments

        Activity

          People

            yquinn@redhat.com Yanir Quinn
            elgerman Elena German
            Gowrishankar Rajaiyan Gowrishankar Rajaiyan
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: