Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-52468

[release-4.17] LVM Fails to add a new node in LVMVolumeGroupNodeStatus

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 2
    • Important
    • None
    • None
    • None
    • OCPEDGE Sprint 269
    • 1
    • Done
    • Bug Fix
    • Previously, LVMS failed to create the necessary resources on a newly added node in a long-running cluster, causing delays in volume creation. With this fix, LVMS creates all required resources as soon as the node becomes ready.
    • None
    • None
    • None
    • None

      Description of problem:

      When deploying a SNO + worker Node in two separate steps, the vg-manager pod on the newly added worker node remains in a Running 0/1 state and it does not create the expected LVMVolumeGroupNodeStatus resource for the worker node. This issue has been reproduced in both a customer environment and an internal lab.
      

      Version-Release number of selected component (if applicable):

      lvms-operator.v4.18.0

      Steps to Reproduce:

      1. Deploy the initial SNO Cluster
      2. Attach extra disk to the master node
      3. Install the LVMS Operator
      4. Create the LVMCluster
      5. Scale Up the cluster and add extra disk to the worker node
      6. Verify the Pod Status
      
      $ oc get pod -n openshift-storage
      NAME                             READY   STATUS    RESTARTS      AGE
      lvms-operator-85598566c6-tz9qw   1/1     Running   0             27m
      vg-manager-m25l6                 1/1     Running   1 (25m ago)   25m
      vg-manager-vsbvs                 0/1     Running   0             14m
      
      # oc logs vg-manager-8jxbr -f
      ...
      {"level":"error","ts":"2025-03-04T17:27:00Z","msg":"Reconciler error","controller":"lvmvolumegroup","controllerGroup":"lvm.topolvm.io","controllerKind":"LVMVolumeGroup","LVMVolumeGroup":{"name":"vg1","namespace":"openshift-storage"},"namespace":"openshift-storage","name":"vg1","reconcileID":"aa2f3e23-4adc-46c7-87cc-bdb671fa9a42","error":"could not get LVMVolumeGroupNodeStatus: LVMVolumeGroupNodeStatus.lvm.topolvm.io \"mno-worker-0.5g-deployment.lab\" not found","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:224"}
      ...
      

      Find https://gist.github.com/jclaret/306736615fddb6f4e06e8e5e5af3ec8c all yaml files for the reproducer

      Actual results:
      - vg-manager pod on the new worker node (vg-manager-xxxx) stays Running (0/1)
      - the LVMVolumeGroupNodeStatus for the new worker node is not created

      Expected results:

      vg-manager pod on the new worker node (vg-manager-xxxx) stays Running (1/1)

      Workarounds tested:

      - Restarting the lvms-operator-xxx pod forces it to reprocess and detect the new worker
      - Create the LVMCluster with LVMCluster.spec.tolerations
      
      Example:
      
      spec:
        tolerations:
          - key: node.kubernetes.io/not-ready
            effect: NoExecute
      
      
      - Create the LVMCluster to use a custom node label instead of "node-role.kubernetes.io/worker" and apply the label manually to nodes 
      
      Example: 
      
      nodeSelector:
        nodeSelectorTerms:
        - matchExpressions:
          - key: LVMOperator
            operator: In
            values:
            - "true"
      

      MustGather collected:

      • Logs from customer cluster: Dell-SNOP1-logs.tar.xz
      • Logs from internal lab: must-gather_internal_lab.tar.gz

      Find both in https://drive.google.com/drive/folders/15WbB341JUMBaI8IZQUZ8xd3_gNyqT2vk?usp=drive_link

              sakbas@redhat.com Suleyman Akbas
              rhn-support-jclaretm Jorge Claret Membrado
              None
              None
              Minal Pradeep Makwana Minal Pradeep Makwana
              None
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: