Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: None
Affects Version/s: 4.17.z, 4.18.z
Component/s: Logical Volume Manager Storage
Labels:
- ocpedge
- triaged

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
2
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:

4.17.z
Release Blocker:
None
Sprint:
OCPEDGE Sprint 269
sprint_count:
1

RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
Done
Release Note Type:
Bug Fix
Release Note Text:
Previously, LVMS failed to create the necessary resources on a newly added node in a long-running cluster, causing delays in volume creation. With this fix, LVMS creates all required resources as soon as the node becomes ready.

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

When deploying a SNO + worker Node in two separate steps, the vg-manager pod on the newly added worker node remains in a Running 0/1 state and it does not create the expected LVMVolumeGroupNodeStatus resource for the worker node. This issue has been reproduced in both a customer environment and an internal lab.

Version-Release number of selected component (if applicable):

lvms-operator.v4.18.0

Steps to Reproduce:

1. Deploy the initial SNO Cluster
2. Attach extra disk to the master node
3. Install the LVMS Operator
4. Create the LVMCluster
5. Scale Up the cluster and add extra disk to the worker node
6. Verify the Pod Status

$ oc get pod -n openshift-storage
NAME                             READY   STATUS    RESTARTS      AGE
lvms-operator-85598566c6-tz9qw   1/1     Running   0             27m
vg-manager-m25l6                 1/1     Running   1 (25m ago)   25m
vg-manager-vsbvs                 0/1     Running   0             14m

# oc logs vg-manager-8jxbr -f
...
{"level":"error","ts":"2025-03-04T17:27:00Z","msg":"Reconciler error","controller":"lvmvolumegroup","controllerGroup":"lvm.topolvm.io","controllerKind":"LVMVolumeGroup","LVMVolumeGroup":{"name":"vg1","namespace":"openshift-storage"},"namespace":"openshift-storage","name":"vg1","reconcileID":"aa2f3e23-4adc-46c7-87cc-bdb671fa9a42","error":"could not get LVMVolumeGroupNodeStatus: LVMVolumeGroupNodeStatus.lvm.topolvm.io \"mno-worker-0.5g-deployment.lab\" not found","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/remote-source/app/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:224"}
...

Find https://gist.github.com/jclaret/306736615fddb6f4e06e8e5e5af3ec8c all yaml files for the reproducer

Actual results:
- vg-manager pod on the new worker node (vg-manager-xxxx) stays Running (0/1)
- the LVMVolumeGroupNodeStatus for the new worker node is not created

Expected results:

vg-manager pod on the new worker node (vg-manager-xxxx) stays Running (1/1)

Workarounds tested:

- Restarting the lvms-operator-xxx pod forces it to reprocess and detect the new worker
- Create the LVMCluster with LVMCluster.spec.tolerations

Example:

spec:
  tolerations:
    - key: node.kubernetes.io/not-ready
      effect: NoExecute


- Create the LVMCluster to use a custom node label instead of "node-role.kubernetes.io/worker" and apply the label manually to nodes 

Example: 

nodeSelector:
  nodeSelectorTerms:
  - matchExpressions:
    - key: LVMOperator
      operator: In
      values:
      - "true"

MustGather collected:

Logs from customer cluster: Dell-SNOP1-logs.tar.xz
Logs from internal lab: must-gather_internal_lab.tar.gz

Find both in https://drive.google.com/drive/folders/15WbB341JUMBaI8IZQUZ8xd3_gNyqT2vk?usp=drive_link

clones

OCPBUGS-52311 [release-4.18] LVM Fails to add a new node in LVMVolumeGroupNodeStatus

Closed

depends on

OCPBUGS-52311 [release-4.18] LVM Fails to add a new node in LVMVolumeGroupNodeStatus

Closed

links to

openshift/lvm-operator#909: [release-4.17] OCPBUGS-52468: Retry Nodes with not-ready taint

RHBA-2025:148432 LVMS 4.17.6 Bug Fix Update

mentioned on

Merge request - Updated US source to: 758b5d2 Merge pull request #909 from openshift-cherrypick-robot/cherry-pick-904-to-release-4.17

Assignee:: Suleyman Akbas

Reporter:: Jorge Claret Membrado

Need Info From:: None

Contributors:: None

QA Contact:: Minal Pradeep Makwana

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2025/03/06 10:01 AM

Updated:: 2025/07/15 1:29 PM

Resolved:: 2025/05/15 12:49 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates