-
Bug
-
Resolution: Done
-
Normal
-
None
-
4.15
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
Moderate
-
No
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
This is continuation of OCPBUGS-23062.
With https://github.com/openshift/csi-external-provisioner/pull/77 merged, we observed that a gRPC connection between the external-provisioner and CSi driver closes after ~30 minutes. The next provisioning then detects the connection is closed and the provisioner dies with "Lost connection to CSI driver, exiting". Therefore the volume provisioning must wait for a new leader to be elected and our e2e tests time out.
With #77 merged and GRPC_GO_LOG_VERBOSITY_LEVEL=99 and GRPC_GO_LOG_SEVERITY_LEVEL=info in the external-provisioner sidecar, I can see after successful Probe and ControllerGetCapabilities and ~30 minutes wait:
I1109 10:05:58.861712 1 leaderelection.go:260] successfully acquired lease openshift-cluster-csi-drivers/ebs-csi-aws-com I1109 10:05:58.861842 1 leader_election.go:178] became leader, starting I1109 10:05:58.962821 1 controller.go:811] Starting provisioner controller ebs.csi.aws.com_aws-ebs-csi-driver-controller-7689c994fb-d68d5_e450196b-2b73-4b20-bb50-f46ef48d1ab2! I1109 10:05:58.962896 1 volume_store.go:97] Starting save volume queue I1109 10:05:59.064012 1 controller.go:860] Started provisioner controller ebs.csi.aws.com_aws-ebs-csi-driver-controller-7689c994fb-d68d5_e450196b-2b73-4b20-bb50-f46ef48d1ab2! 2023/11/09 10:33:41 INFO: [core] [Channel #4] Closing the name resolver 2023/11/09 10:33:41 INFO: [core] [Channel #4] ccBalancerWrapper: entering idle mode 2023/11/09 10:33:41 INFO: [core] [Channel #4] Channel Connectivity change to IDLE 2023/11/09 10:33:41 INFO: [core] [Channel #4] Channel entering idle mode 2023/11/09 10:33:41 INFO: [core] [Channel #4 SubChannel #5] Subchannel Connectivity change to SHUTDOWN 2023/11/09 10:33:41 INFO: [core] [Channel #4 SubChannel #5] Subchannel deleted 2023/11/09 10:33:41 INFO: [transport] [client-transport 0xc0000f1440] Closing: grpc: the connection is closing due to channel idleness 2023/11/09 10:33:41 INFO: [transport] [client-transport 0xc0000f1440] loopyWriter exiting with error: transport closed by client
Then the subsequent provisioning leads to provisioner exit:
I1109 11:02:52.817054 1 controller.go:1366] provision "default/myclaim" class "gp3-csi": started ... 2023/11/09 11:02:52 INFO: [core] [Channel #4 SubChannel #7] Subchannel created 2023/11/09 11:02:52 INFO: [core] [Channel #4] Channel Connectivity change to CONNECTING 2023/11/09 11:02:52 INFO: [core] [Channel #4] Channel exiting idle mode 2023/11/09 11:02:52 INFO: [core] [Channel #4 SubChannel #7] Subchannel Connectivity change to CONNECTING 2023/11/09 11:02:52 INFO: [core] [Channel #4 SubChannel #7] Subchannel picks a new address "/var/lib/csi/sockets/pluginproxy/csi.sock" to connect 2023/11/09 11:02:52 INFO: [core] [pick-first-lb 0xc000d2d6b0] Received SubConn state update: 0xc000d2d740, {ConnectivityState:CONNECTING ConnectionError:<nil>} I1109 11:02:52.817650 1 event.go:298] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"myclaim", UID:"e3cc5c86-97ca-484a-8259-8f30579ee689", APIVersion:"v1", ResourceVersion:"54539", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/myclaim" E1109 11:02:52.817712 1 connection.go:142] Lost connection to unix:///var/lib/csi/sockets/pluginproxy/csi.sock. F1109 11:02:52.817811 1 connection.go:97] Lost connection to CSI driver, exiting
We need to figure out how to apply #77 (i.e. bump gRPC from 0.57 to 0.59) and not exit after 30 minutes idle.
Version-Release number of selected component (if applicable):
4.15 nightly + #77 merged
How reproducible:
Always
Steps to Reproduce:
- Install a cluster
- Wait 30 minutes
- Create AWS EBS PVC + Pod
Actual results:
AWS EBS controller pod observes a container restart:
# oc -n openshift-cluster-csi-drivers get pod NAME READY STATUS RESTARTS aws-ebs-csi-driver-controller-7689c994fb-d68d5 11/11 Running 1 (12s ago)
Expected results:
No restarts