Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-77255

SSH server is degraded on bootstrap node during install

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.22.0
    • RHCOS
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

          When installing an OCP cluster, we currently cannot SSH into the bootstrap node at all. If attempting to, we see the following error: "kex_exchange_identification: read: Connection reset by peer Connection reset by 51.224.118.32 port 22."
      
          This is the same issue encountered when trying to gather bootstrap logs (via SSH). If trying to, we see a similar error: msg="Failed to gather bootstrap logs: failed to create SSH client: ssh: handshake failed: read tcp 172.24.247.41:35178->20.57.194.6:22: read: connection reset by peer". The log bundle is empty.

      Version-Release number of selected component (if applicable):

          4.22.0 (AWS, Azure, maybe more platforms)

      How reproducible:

          Always    

      Steps to Reproduce:

          1. Use the openshift-installer in any of 4.22 nightly payload 
          2. Install the cluster and wait till "waiting for bootstrap to complete" phase
          3. Try to SSH into the bootstrap node using its public IP/DNS or try openshift-install gather bootstrap
          4. Observe the connection reset problem

      Actual results:

          Error with connection reset by peer    

      Expected results:

          SSH should succeed

      Additional info:

      We do observe the issue in installer presubmit jobs and blocking jobs of release payload. See below examples for the install log (with the error) + empty log bundle.

      Attempting to debug it, I noticed that if the bootstrap node is rebooted (i.e. essentially restarted the sshd process), SSH works just fine. The journal log for sshd since last boot shows that:

      Feb 24 18:19:34 i-0128b33b58166aa6a sshd[11661]: -R not supported here 

      This probably causes SSH server to degrade, but I am clueless why it's happening now...

      Note: This problem does not cause the install to fail. However, it broke the ability to gather bootstrap logs or ssh into bootstrap node for debugging.

              Unassigned Unassigned
              rh-ee-thvo Thuan Vo
              None
              None
              Tiago Bueno Tiago Bueno
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: