Loading...

XML

Word

Printable

Type: Bug
Resolution: Can't Do
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.12
Component/s: Storage
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
Storage Sprint 228
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Install cluster against pre-merged payload with PR https://github.com/openshift/console/pull/12068, update vsphere platform parameters through static vsphere connection plugin from console dashboard, it will generate new mc and trigger node reboot. Then find that control node is stuck in status "Ready,SchedulingDisabled".

Checked from machine-config-contorller node, failed to drain node, because pod vmware-vsphere-csi-driver-controller could not be evicted as it would violate the pod's PDB.

$ oc logs machine-config-controller-bdcdfc88d-p5bbq  -n openshift-machine-config-operator -c machine-config-controller
...
I1014 08:02:33.137138       1 drain_controller.go:110] evicting pod openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller-756554dfb6-kzvhh
E1014 08:02:33.147025       1 drain_controller.go:110] error when evicting pods/"vmware-vsphere-csi-driver-controller-756554dfb6-kzvhh" -n "openshift-cluster-csi-drivers" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I1014 08:02:38.147737       1 drain_controller.go:110] evicting pod openshift-cluster-csi-drivers/vmware-vsphere-csi-driver-controller-756554dfb6-kzvhh
I1014 08:02:38.147903       1 drain_controller.go:139] node jimadummy01-bqv6h-control-plane-0: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when evicting pods/"vmware-vsphere-csi-driver-controller-756554dfb6-kzvhh" -n "openshift-cluster-csi-drivers": global timeout reached: 1m30s

$ oc get nodes
NAME                                STATUS                     ROLES                  AGE     VERSION
jimadummy01-bqv6h-compute-0         Ready                      worker                 7h29m   v1.25.0+3ef6ef3
jimadummy01-bqv6h-compute-1         Ready                      worker                 7h29m   v1.25.0+3ef6ef3
jimadummy01-bqv6h-control-plane-0   Ready,SchedulingDisabled   control-plane,master   7h47m   v1.25.0+3ef6ef3
jimadummy01-bqv6h-control-plane-1   Ready                      control-plane,master   7h47m   v1.25.0+3ef6ef3
jimadummy01-bqv6h-control-plane-2   Ready                      control-plane,master   7h46m   v1.25.0+3ef6ef3 

$ oc get deployment.apps/vmware-vsphere-csi-driver-controller -n openshift-cluster-csi-drivers
NAME                                   READY   UP-TO-DATE   AVAILABLE   AGE
vmware-vsphere-csi-driver-controller   0/2     2            0           95m

$ oc get pdb -n openshift-cluster-csi-drivers
NAME                                       MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
vmware-vsphere-csi-driver-controller-pdb   N/A             1                 0                     94m
vmware-vsphere-csi-driver-webhook-pdb      N/A             1                 1                     94m

$ oc get pod -n openshift-cluster-csi-drivers | grep controller
vmware-vsphere-csi-driver-controller-756554dfb6-9nxnz   11/13   CrashLoopBackOff   27 (2m24s ago)   50m
vmware-vsphere-csi-driver-controller-756554dfb6-kzvhh   11/13   CrashLoopBackOff   27 (3m3s ago)    50m

In pod vmware-vsphere-csi-driver-controller log:

{"level":"error","time":"2022-10-14T08:22:15.11976544Z","caller":"service/driver.go:138","msg":"failed to init controller. Error: ServerFaultCode: Cannot complete login due to an incorrect user name or password.","TraceId":"fe46aef8-e7a4-4045-91b9-566a58e605ea","TraceId":"efd0a8fe-de31-4c01-aeec-b817bbb710b6","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service.(*vsphereCSIDriver).BeforeServe\n\t/go/src/github.com/kubernetes-sigs/vsphere-csi-driver/pkg/csi/service/driver.go:138\nsigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service.(*vsphereCSIDriver).Run\n\t/go/src/github.com/kubernetes-sigs/vsphere-csi-driver/pkg/csi/service/driver.go:151\nmain.main\n\t/go/src/github.com/kubernetes-sigs/vsphere-csi-driver/cmd/vsphere-csi/main.go:71\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:250"}
{"level":"info","time":"2022-10-14T08:22:15.119814564Z","caller":"service/driver.go:104","msg":"Configured: \"csi.vsphere.vmware.com\" with clusterFlavor: \"VANILLA\" and mode: \"controller\"","TraceId":"fe46aef8-e7a4-4045-91b9-566a58e605ea","TraceId":"efd0a8fe-de31-4c01-aeec-b817bbb710b6"}
{"level":"error","time":"2022-10-14T08:22:15.119832373Z","caller":"service/driver.go:152","msg":"failed to run the driver. Err: +ServerFaultCode: Cannot complete login due to an incorrect user name or password.","TraceId":"fe46aef8-e7a4-4045-91b9-566a58e605ea","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v2/pkg/csi/service.(*vsphereCSIDriver).Run\n\t/go/src/github.com/kubernetes-sigs/vsphere-csi-driver/pkg/csi/service/driver.go:152\nmain.main\n\t/go/src/github.com/kubernetes-sigs/vsphere-csi-driver/cmd/vsphere-csi/main.go:71\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:250"}

Once PR https://github.com/openshift/console/pull/12068 gets merged, user will be able to update vsphere platform parameters from console dashboard easily, and then updating with invalid credentials will break the cluster as this issue.

Version-Release number of selected component (if applicable):

4.12 pre-merged payload with PR console#12068

How reproducible:

Always if input invalid username and also update datacenter/datastore/folder... parameter.

Steps to Reproduce:

1. Install cluster against 4.12 pre-merged payload with PR console#12068
2. Update vsphere platform parameters through vsphere connection on console dashboard

Actual results:

one master node is stuck in status "Ready,SchedulingDisabled".

Expected results:

cluster is successful to apply new machine config.

Additional info:

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

image-2022-11-28-11-21-46-347.png
61 kB
2022/11/28 10:21 AM

is blocked by

RFE-1367 Forcefully remove unhealthy pods controlled by PDB during Machine Update

Waiting

Assignee:: Jan Safranek

Reporter:: Jinyun Ma

Need Info From:: None

Contributors:: None

QA Contact:: Wei Duan

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2022/10/14 9:21 AM

Updated:: 2025/07/29 5:29 AM

Resolved:: 2022/11/28 12:12 PM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates