Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-14184

[2005693] SSH connection to VM failed once in a while after migration

XMLWordPrintable

    • Moderate
    • No

      Description of problem:
      While testing VM migration we notice that sometime SSH connection to VM failed.
      Ping the VM right after migration from one of the nodes and see if we didn't witness packet loss.

      Version-Release number of selected component (if applicable):
      CNV 4.9

      Steps to Reproduce:
      1. Create migratable VM with OCS storage, create ssh service to VM
      2. Migrate VM
      3. Connect via SSH
      4. Pause, Un-pause VM
      5. Connect VM SSH

      Actual results:
      Once in while SSH connection failing, we see it a lot on the automation runs.

      Additional info:
      (from the mail thread)
      1.
      We did automation test to get statistics of the issue:
      Ran a loop of migrate vm + connect via ssh for an hour (after each migration perform 10 times ssh_vm-pause-unpause-ssh_vm):
      ---------------------------------------------------
      vm = golden_image_vm_object_from_template_multi_fedora_os_multi_storage_scope_class
      iter_pass = 0
      iter_fail = 0
      import time
      with open('test.log', 'w') as ff:
      while True:
      ff.write("----------------Migrate VM----------------\n")
      migrate_vm_and_verify(vm=vm, check_ssh_connectivity=True)
      for i in range(0,10):
      try:
      validate_pause_unpause_linux_vm(vm=vm, pre_pause_pid=ping_process_in_fedora_os)
      iter_pass += 1
      except Exception:
      ff.write("FAIL!!!\n")
      iter_fail += 1
      time.sleep(1)
      ff.write(f"PASSED:

      {iter_pass}

      \n")
      ff.write(f"FAILED:

      {iter_fail}

      \n")

      1) migrate_vm_and_verify migrates vm, checks if it succeeded and check ssh connection
      2) validate_pause_unpause_linux_vm connects via ssh and creates ping process, pause/unpause vm, ssh and check process id

      (counter is for validate_pause_unpause_linux_vm)
      The result is:
      PASSED: 396
      FAILED: 14

      Meaning:
      validate_pause_unpause_linux_vm rarely fails on first iteration ONLY (rest 9 succeeds)

      SSH failure: socket.timeout: 10.1.156.18: timeout(10.0)
      2.
      Ran loop of 400 SSH connection to VM --> all pass.

              phoracek@redhat.com Petr Horacek
              ipinto_rh Israel Pinto
              Yoss Segev Yoss Segev
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: