Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.15
Component/s: Storage / Kubernetes External Components
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Moderate
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

This is continuation of ~~OCPBUGS-23062~~.

With https://github.com/openshift/csi-external-provisioner/pull/77 merged, we observed that a gRPC connection between the external-provisioner and CSi driver closes after ~30 minutes. The next provisioning then detects the connection is closed and the provisioner dies with "Lost connection to CSI driver, exiting". Therefore the volume provisioning must wait for a new leader to be elected and our e2e tests time out.

With #77 merged and GRPC_GO_LOG_VERBOSITY_LEVEL=99 and GRPC_GO_LOG_SEVERITY_LEVEL=info in the external-provisioner sidecar, I can see after successful Probe and ControllerGetCapabilities and ~30 minutes wait:

I1109 10:05:58.861712       1 leaderelection.go:260] successfully acquired lease openshift-cluster-csi-drivers/ebs-csi-aws-com
I1109 10:05:58.861842       1 leader_election.go:178] became leader, starting
I1109 10:05:58.962821       1 controller.go:811] Starting provisioner controller ebs.csi.aws.com_aws-ebs-csi-driver-controller-7689c994fb-d68d5_e450196b-2b73-4b20-bb50-f46ef48d1ab2!
I1109 10:05:58.962896       1 volume_store.go:97] Starting save volume queue
I1109 10:05:59.064012       1 controller.go:860] Started provisioner controller ebs.csi.aws.com_aws-ebs-csi-driver-controller-7689c994fb-d68d5_e450196b-2b73-4b20-bb50-f46ef48d1ab2!
2023/11/09 10:33:41 INFO: [core] [Channel #4] Closing the name resolver
2023/11/09 10:33:41 INFO: [core] [Channel #4] ccBalancerWrapper: entering idle mode
2023/11/09 10:33:41 INFO: [core] [Channel #4] Channel Connectivity change to IDLE
2023/11/09 10:33:41 INFO: [core] [Channel #4] Channel entering idle mode
2023/11/09 10:33:41 INFO: [core] [Channel #4 SubChannel #5] Subchannel Connectivity change to SHUTDOWN
2023/11/09 10:33:41 INFO: [core] [Channel #4 SubChannel #5] Subchannel deleted
2023/11/09 10:33:41 INFO: [transport] [client-transport 0xc0000f1440] Closing: grpc: the connection is closing due to channel idleness
2023/11/09 10:33:41 INFO: [transport] [client-transport 0xc0000f1440] loopyWriter exiting with error: transport closed by client

Then the subsequent provisioning leads to provisioner exit:

I1109 11:02:52.817054       1 controller.go:1366] provision "default/myclaim" class "gp3-csi": started
...
2023/11/09 11:02:52 INFO: [core] [Channel #4 SubChannel #7] Subchannel created
2023/11/09 11:02:52 INFO: [core] [Channel #4] Channel Connectivity change to CONNECTING
2023/11/09 11:02:52 INFO: [core] [Channel #4] Channel exiting idle mode
2023/11/09 11:02:52 INFO: [core] [Channel #4 SubChannel #7] Subchannel Connectivity change to CONNECTING
2023/11/09 11:02:52 INFO: [core] [Channel #4 SubChannel #7] Subchannel picks a new address "/var/lib/csi/sockets/pluginproxy/csi.sock" to connect
2023/11/09 11:02:52 INFO: [core] [pick-first-lb 0xc000d2d6b0] Received SubConn state update: 0xc000d2d740, {ConnectivityState:CONNECTING ConnectionError:<nil>}
I1109 11:02:52.817650       1 event.go:298] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"myclaim", UID:"e3cc5c86-97ca-484a-8259-8f30579ee689", APIVersion:"v1", ResourceVersion:"54539", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/myclaim"
E1109 11:02:52.817712       1 connection.go:142] Lost connection to unix:///var/lib/csi/sockets/pluginproxy/csi.sock.
F1109 11:02:52.817811       1 connection.go:97] Lost connection to CSI driver, exiting

We need to figure out how to apply #77 (i.e. bump gRPC from 0.57 to 0.59) and not exit after 30 minutes idle.

Version-Release number of selected component (if applicable):

4.15 nightly + #77 merged

How reproducible:

Always

Steps to Reproduce:

Install a cluster
Wait 30 minutes
Create AWS EBS PVC + Pod

Actual results:

AWS EBS controller pod observes a container restart:

# oc -n openshift-cluster-csi-drivers get pod
NAME                                             READY   STATUS    RESTARTS
aws-ebs-csi-driver-controller-7689c994fb-d68d5   11/11   Running   1 (12s ago)

Expected results:

No restarts

Assignee:: Jan Safranek

Reporter:: Jan Safranek

Need Info From:: None

Contributors:: None

QA Contact:: Wei Duan

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2023/11/09 11:17 AM

Updated:: 2025/07/24 11:44 PM

Resolved:: 2023/12/19 12:02 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates