Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-23099

CSI connection is closed after ~30 minutes

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • No
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      This is continuation of OCPBUGS-23062.

      With https://github.com/openshift/csi-external-provisioner/pull/77 merged, we observed that a gRPC connection between the external-provisioner and CSi driver closes after ~30 minutes. The next provisioning then detects the connection is closed and the provisioner dies with "Lost connection to CSI driver, exiting". Therefore the volume provisioning must wait for a new leader to be elected and our e2e tests time out.

      With #77 merged and GRPC_GO_LOG_VERBOSITY_LEVEL=99 and GRPC_GO_LOG_SEVERITY_LEVEL=info in the external-provisioner sidecar, I can see after successful Probe and ControllerGetCapabilities and ~30 minutes wait:

      I1109 10:05:58.861712       1 leaderelection.go:260] successfully acquired lease openshift-cluster-csi-drivers/ebs-csi-aws-com
      I1109 10:05:58.861842       1 leader_election.go:178] became leader, starting
      I1109 10:05:58.962821       1 controller.go:811] Starting provisioner controller ebs.csi.aws.com_aws-ebs-csi-driver-controller-7689c994fb-d68d5_e450196b-2b73-4b20-bb50-f46ef48d1ab2!
      I1109 10:05:58.962896       1 volume_store.go:97] Starting save volume queue
      I1109 10:05:59.064012       1 controller.go:860] Started provisioner controller ebs.csi.aws.com_aws-ebs-csi-driver-controller-7689c994fb-d68d5_e450196b-2b73-4b20-bb50-f46ef48d1ab2!
      2023/11/09 10:33:41 INFO: [core] [Channel #4] Closing the name resolver
      2023/11/09 10:33:41 INFO: [core] [Channel #4] ccBalancerWrapper: entering idle mode
      2023/11/09 10:33:41 INFO: [core] [Channel #4] Channel Connectivity change to IDLE
      2023/11/09 10:33:41 INFO: [core] [Channel #4] Channel entering idle mode
      2023/11/09 10:33:41 INFO: [core] [Channel #4 SubChannel #5] Subchannel Connectivity change to SHUTDOWN
      2023/11/09 10:33:41 INFO: [core] [Channel #4 SubChannel #5] Subchannel deleted
      2023/11/09 10:33:41 INFO: [transport] [client-transport 0xc0000f1440] Closing: grpc: the connection is closing due to channel idleness
      2023/11/09 10:33:41 INFO: [transport] [client-transport 0xc0000f1440] loopyWriter exiting with error: transport closed by client
       

      Then the subsequent provisioning leads to provisioner exit:

      I1109 11:02:52.817054       1 controller.go:1366] provision "default/myclaim" class "gp3-csi": started
      ...
      2023/11/09 11:02:52 INFO: [core] [Channel #4 SubChannel #7] Subchannel created
      2023/11/09 11:02:52 INFO: [core] [Channel #4] Channel Connectivity change to CONNECTING
      2023/11/09 11:02:52 INFO: [core] [Channel #4] Channel exiting idle mode
      2023/11/09 11:02:52 INFO: [core] [Channel #4 SubChannel #7] Subchannel Connectivity change to CONNECTING
      2023/11/09 11:02:52 INFO: [core] [Channel #4 SubChannel #7] Subchannel picks a new address "/var/lib/csi/sockets/pluginproxy/csi.sock" to connect
      2023/11/09 11:02:52 INFO: [core] [pick-first-lb 0xc000d2d6b0] Received SubConn state update: 0xc000d2d740, {ConnectivityState:CONNECTING ConnectionError:<nil>}
      I1109 11:02:52.817650       1 event.go:298] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"myclaim", UID:"e3cc5c86-97ca-484a-8259-8f30579ee689", APIVersion:"v1", ResourceVersion:"54539", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/myclaim"
      E1109 11:02:52.817712       1 connection.go:142] Lost connection to unix:///var/lib/csi/sockets/pluginproxy/csi.sock.
      F1109 11:02:52.817811       1 connection.go:97] Lost connection to CSI driver, exiting

      We need to figure out how to apply #77 (i.e. bump gRPC from 0.57 to 0.59) and not exit after 30 minutes idle.

      Version-Release number of selected component (if applicable):

      4.15 nightly + #77 merged

      How reproducible:

      Always

      Steps to Reproduce:

      1. Install a cluster
      2. Wait 30 minutes
      3. Create AWS EBS PVC + Pod

      Actual results:

      AWS EBS controller pod observes a container restart:

      # oc -n openshift-cluster-csi-drivers get pod
      NAME                                             READY   STATUS    RESTARTS
      aws-ebs-csi-driver-controller-7689c994fb-d68d5   11/11   Running   1 (12s ago)

      Expected results:

      No restarts

       

              rhn-engineering-jsafrane Jan Safranek
              rhn-engineering-jsafrane Jan Safranek
              None
              None
              Wei Duan Wei Duan
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: