Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-137

[2302235] [UI deployment][ODF on ROSA HCP] No correct labels on the worker nodes

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • odf-4.17.1
    • odf-4.16
    • management-console
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • Committed
    • 4.17.0-77
    • ?
    • Hide
      Cause: ODF is installed in a namespace other than "openshift-storage" (ROSA use case).

      Consequence:
      UI label the nodes while StorageSystem deployment and adds a dynamic label "cluster.ocs.openshift.io/<CLUSTER_NAMESPACE>: ''" (where "CLUSTER_NAMESPACE" is the namespace where StorageSystem is getting created).

      ODF/OCS operators on the other hand are still expecting label to be static and always equal to "cluster.ocs.openshift.io/openshift-storage: ''", irrespective of where ODF is installed or StorageSystem is deployed.

      Fix:
      UI will now always add a static label "cluster.ocs.openshift.io/openshift-storage: ''" to the nodes.

      Result:
      Install should proceed as expected now.

      Workaround:
      Label the nodes manually on which we want to deploy the StorageSystem related workloads.
      Example:
      To label all the worker nodes: `oc label node -l node-role.kubernetes.io/worker cluster.ocs.openshift.io/openshift-storage=""`.

      To label a specific node(s): `oc label node <NODE_NAME> cluster.ocs.openshift.io/openshift-storage=""`
      Show
      Cause: ODF is installed in a namespace other than "openshift-storage" (ROSA use case). Consequence: UI label the nodes while StorageSystem deployment and adds a dynamic label "cluster.ocs.openshift.io/<CLUSTER_NAMESPACE>: ''" (where "CLUSTER_NAMESPACE" is the namespace where StorageSystem is getting created). ODF/OCS operators on the other hand are still expecting label to be static and always equal to "cluster.ocs.openshift.io/openshift-storage: ''", irrespective of where ODF is installed or StorageSystem is deployed. Fix: UI will now always add a static label "cluster.ocs.openshift.io/openshift-storage: ''" to the nodes. Result: Install should proceed as expected now. Workaround: Label the nodes manually on which we want to deploy the StorageSystem related workloads. Example: To label all the worker nodes: `oc label node -l node-role.kubernetes.io/worker cluster.ocs.openshift.io/openshift-storage=""`. To label a specific node(s): `oc label node <NODE_NAME> cluster.ocs.openshift.io/openshift-storage=""`
    • Bug Fix
    • Approved
    • None

      Description of problem:

      ----------

      On fresh ODF deployment installed on 'odf-storage' namespace the node labels present:

      oc get nodes -l cluster.ocs.openshift.io/odf-storage=""
      NAME STATUS ROLES AGE VERSION
      ip-10-0-0-144.us-west-2.compute.internal Ready worker 18h v1.29.6+aba1e8d
      ip-10-0-0-181.us-west-2.compute.internal Ready worker 18h v1.29.6+aba1e8d
      ip-10-0-0-45.us-west-2.compute.internal Ready worker 21h v1.29.6+aba1e8d
      ip-10-0-0-70.us-west-2.compute.internal Ready worker 18h v1.29.6+aba1e8d
      ip-10-0-0-78.us-west-2.compute.internal Ready worker 18h v1.29.6+aba1e8d
      ip-10-0-0-95.us-west-2.compute.internal Ready worker 21h v1.29.6+aba1e8d

      That triggers an error on StorageCluster
      “Not enough nodes found” (screenshot added)

      StorageCluster in error state
      oc get storagecluster -A
      NAMESPACE NAME AGE PHASE EXTERNAL CREATED AT VERSION
      odf-storage ocs-storagecluster 7m25s Error 2024-07-31T15:53:39Z 4.16.0
      [jenkins@temp-jagent-dosypenk-r217 terraform-vpc-example]$ oc describe storagecluster ocs-storagecluster -nodf-storage
      Name: ocs-storagecluster
      Namespace: odf-storage
      Labels: <none>
      Annotations: uninstall.ocs.openshift.io/cleanup-policy: delete
      uninstall.ocs.openshift.io/mode: graceful
      API Version: ocs.openshift.io/v1
      Kind: StorageCluster
      Metadata:
      Creation Timestamp: 2024-07-31T15:53:39Z
      Finalizers:
      storagecluster.ocs.openshift.io
      Generation: 2
      Owner References:
      API Version: odf.openshift.io/v1alpha1
      Kind: StorageSystem
      Name: ocs-storagecluster-storagesystem
      UID: 2dee21a8-8039-4640-8fd1-9e7a669356b6
      Resource Version: 101564
      UID: 1d04f184-50c7-4f6f-9777-0f197a2fc1d1
      Spec:
      Arbiter:
      Encryption:
      Key Rotation:
      Schedule: @weekly
      Kms:
      External Storage:
      Managed Resources:
      Ceph Block Pools:
      Ceph Cluster:
      Ceph Config:
      Ceph Dashboard:
      Ceph Filesystems:
      Data Pool Spec:
      Application:
      Erasure Coded:
      Coding Chunks: 0
      Data Chunks: 0
      Mirroring:
      Quotas:
      Replicated:
      Size: 0
      Status Check:
      Mirror:
      Ceph Non Resilient Pools:
      Count: 1
      Resources:
      Volume Claim Template:
      Metadata:
      Spec:
      Resources:
      Status:
      Ceph Object Store Users:
      Ceph Object Stores:
      Ceph RBD Mirror:
      Daemon Count: 1
      Ceph Toolbox:
      Mirroring:
      Network:
      Connections:
      Encryption:
      Multi Cluster Service:
      Node Topologies:
      Resource Profile: lean
      Storage Device Sets:
      Config:
      Count: 1
      Data PVC Template:
      Metadata:
      Spec:
      Access Modes:
      ReadWriteOnce
      Resources:
      Requests:
      Storage: 2Ti
      Storage Class Name: gp3-csi
      Volume Mode: Block
      Status:
      Name: ocs-deviceset-gp3-csi
      Placement:
      Portable: true
      Prepare Placement:
      Replica: 3
      Resources:
      Status:
      Conditions:
      Last Heartbeat Time: 2024-07-31T15:53:40Z
      Last Transition Time: 2024-07-31T15:53:40Z
      Message: Version check successful
      Reason: VersionMatched
      Status: False
      Type: VersionMismatch
      Last Heartbeat Time: 2024-07-31T15:59:08Z
      Last Transition Time: 2024-07-31T15:53:40Z
      Message: Error while reconciling: Not enough nodes found: Expected 3, found 0
      Reason: ReconcileFailed
      Status: False
      Type: ReconcileComplete
      Last Heartbeat Time: 2024-07-31T15:53:40Z
      Last Transition Time: 2024-07-31T15:53:40Z
      Message: Initializing StorageCluster
      Reason: Init
      Status: False
      Type: Available
      Last Heartbeat Time: 2024-07-31T15:53:40Z
      Last Transition Time: 2024-07-31T15:53:40Z
      Message: Initializing StorageCluster
      Reason: Init
      Status: True
      Type: Progressing
      Last Heartbeat Time: 2024-07-31T15:53:40Z
      Last Transition Time: 2024-07-31T15:53:40Z
      Message: Initializing StorageCluster
      Reason: Init
      Status: False
      Type: Degraded
      Last Heartbeat Time: 2024-07-31T15:53:40Z
      Last Transition Time: 2024-07-31T15:53:40Z
      Message: Initializing StorageCluster
      Reason: Init
      Status: Unknown
      Type: Upgradeable
      Images:
      Ceph:
      Desired Image: registry.redhat.io/rhceph/rhceph-7-rhel9@sha256:579e5358418e176194812eeab523289a0c65e366250688be3f465f1a633b026d
      Noobaa Core:
      Desired Image: registry.redhat.io/odf4/mcg-core-rhel9@sha256:5f56419be1582bf7a0ee0b9d99efae7523fbf781a88f8fe603182757a315e871
      Noobaa DB:
      Desired Image: registry.redhat.io/rhel9/postgresql-15@sha256:5c4cad6de1b8e2537c845ef43b588a11347a3297bfab5ea611c032f866a1cb4e
      Kms Server Connection:
      Phase: Error
      Version: 4.16.0
      Events: <none>
      [jenkins@temp-jagent-dosypenk-r217 terraform-vpc-example]$ oc get nodes -w
      NAME STATUS ROLES AGE VERSION
      ip-10-0-0-144.us-west-2.compute.internal Ready worker 39m v1.29.6+aba1e8d
      ip-10-0-0-181.us-west-2.compute.internal Ready worker 39m v1.29.6+aba1e8d
      ip-10-0-0-45.us-west-2.compute.internal Ready worker 3h30m v1.29.6+aba1e8d
      ip-10-0-0-70.us-west-2.compute.internal Ready worker 41m v1.29.6+aba1e8d
      ip-10-0-0-78.us-west-2.compute.internal Ready worker 43m v1.29.6+aba1e8d
      ip-10-0-0-95.us-west-2.compute.internal Ready worker 3h37m v1.29.6+aba1e8d

      ---------

      Workaround:
      oc label node -l node-role.kubernetes.io/worker cluster.ocs.openshift.io/openshift-storage=""

      ---------

      Version-Release number of selected component (if applicable):
      ODF full_version: 4.16.0-137

      ---------

      How reproducible:
      install ODF on ROSA HCP OCP4.16 cluster

      Steps to Reproduce:
      1. Install ODF 4.16 on ROSA HCP OCP4.16 cluster
      2.
      3.

      ---------

      Actual results:
      Storage cluster "Not enough nodes found" error. ODF installation stalls, no cephfs, rbd storage classes available

      Expected results:
      no errors. ODF is available same as ODF on regular AWS cluster

      ---------

      Additional info:

      ODF installation screen recording - https://drive.google.com/file/d/1y84dNkaj68rov9nbJDAlhcnXwc3cJHs_/view?usp=drive_link

      Storage System installation screen recording - https://drive.google.com/file/d/12KUnujZmTAAC1H0YqnhXsWjD2PtjRblW/view?usp=sharing

              skatiyar@redhat.com Sanjal Katiyar
              rh-ee-dosypenk Daniel Osypenko
              Daniel Osypenko Daniel Osypenko
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated: