Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-849

ROSA HCP fail host osd on non-default machinepool node

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • odf-4.17, odf-4.16
    • ceph-csi-operator, rook
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • ?
    • ?
    • Important
    • None

      Description of problem - Provide a detailed description of the issue encountered, including logs/command-output snippets and screenshots if the issue is observed in the UI:

      Upon storage deployment, the StorageCluster typically utilizes nodes available in a single machine pool. During new deployments and day-two node operations, the StorageCluster can operate across multiple machine pools but* osd pod* fails to relocate to a new node and stuck in Pending.

      OSD description

       

      topology.rook.io/rack:DoNotSchedule when max skew 1 is exceeded for selector app in (rook-ceph-osd)
      Events:
      Type Reason Age From Message
      ---- ------ ---- ---- -------
      Warning FailedScheduling 40s default-scheduler 0/4 nodes are available: 1 node(s) were unschedulable, 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.

       

       

      StorageCluster nodeTopology:

       status: ... nodeTopologies: labels: failure-domain.beta.kubernetes.io/region: - us-west-2 failure-domain.beta.kubernetes.io/zone: - us-west-2a kubernetes.io/hostname: - ip-10-0-0-235.us-west-2.compute.internal - ip-10-0-0-60.us-west-2.compute.internal - ip-10-0-0-71.us-west-2.compute.internal - ip-10-0-0-94.us-west-2.compute.internal topology.rook.io/rack: - rack0 - rack1 - rack2 phase: Ready

      Investigation from kmajumder@redhat.com 

      The new node is ip-10-0-0-145.us-west-2.compute.internal  if I am not wrong.  Should have been labeled as rack1. I will check in code where the new nodes get labeled from.

      oc get nodes -o custom-columns='NAME:.metadata.name,STATUS:.status.conditions[-1].type,TAINTS:.spec.taints[*].key,RACK:.metadata.labels.topology\.rook\.io/rack,INSTANCE_TYPE:.metadata.labels.beta\.kubernetes\.io/instance-type' NAME STATUS TAINTS RACK INSTANCE_TYPE ip-10-0-0-109.us-west-2.compute.internal Ready <none> rack0 m5.12xlarge ip-10-0-0-145.us-west-2.compute.internal Ready <none> rack0 m5.xlarge ip-10-0-0-149.us-west-2.compute.internal Ready node.kubernetes.io/unschedulable rack1 m5.12xlarge ip-10-0-0-40.us-west-2.compute.internal Ready <none> rack2 m5.12xlarge

       

      The OCP platform infrastructure and deployment type (AWS, Bare Metal, VMware, etc. Please clarify if it is platform agnostic deployment), (IPI/UPI):

      ROSA HCP

      The ODF deployment type (Internal, External, Internal-Attached (LSO), Multicluster, DR, Provider, etc):

      Internal

       

      The version of all relevant components (OCP, ODF, RHCS, ACM whichever is applicable):

       

       

      Does this issue impact your ability to continue to work with the product?

      Yes

       

      Is there any workaround available to the best of your knowledge?

      Label node with rack label manually

       

      Can this issue be reproduced? If so, please provide the hit rate

       yes 5/5

       

      Can this issue be reproduced from the UI?

      yes

      If this is a regression, please provide more details to justify this:

      new deployment type

      Steps to Reproduce:
      1. Run create machinepool with node and label it with "openshif-storage" tag
      2. Select any node with osd and cordon it
      3. delete OSD pod on unscheduled node
      4. verify all OSD pods are running
      5. verify rebalancing complete in reasonable time

      The exact date and time when the issue was observed, including timezone details:

       

      Actual results:

      OSD pod stuck in pending, CEPH cluster not healthy

       

      Expected results:

      The StorageCluster can operate across multiple machine pools. Rebalancing complete in reasonable time 

      Logs collected and log location:

       ocs and ocp must-gather: https://url.corp.redhat.com/e7a31de
      logs collected after issue was fixed by @Kaustav Majumder

      Additional info:

      Should be addressed on ROSA HCP 4.16 and 4.17

      We need to ensure nodes across different machinepool may be used on a new StorageSystem installation

              kmajumder@redhat.com Kaustav Majumder
              rh-ee-dosypenk Daniel Osypenko
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: