Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-31467

az.EnsureHostInPool panic when Azure VM instance not found

XMLWordPrintable

    • No
    • CLOUD Sprint 251, CLOUD Sprint 252, CLOUD Sprint 253, CLOUD Sprint 254, CLOUD Sprint 255, CLOUD Sprint 256, CLOUD Sprint 257, CLOUD Sprint 258, CLOUD Sprint 259
    • 9
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, if a virtual machine (VM) was deleted and the network interface controller (NIC) still existed for that VM, the {azure-first} VM verification check failed. With this release, the verification check can now handle this situation by gracefully processing the issue without failing. (link:https://issues.redhat.com/browse/OCPBUGS-31467[*OCPBUGS-31467*])
      Show
      * Previously, if a virtual machine (VM) was deleted and the network interface controller (NIC) still existed for that VM, the {azure-first} VM verification check failed. With this release, the verification check can now handle this situation by gracefully processing the issue without failing. (link: https://issues.redhat.com/browse/OCPBUGS-31467 [* OCPBUGS-31467 *])
    • Bug Fix
    • Done

      Description of problem:

          on Azure, when kube-controller-manager verify whether a machine exists or not, if the machine was already deleted, the code may panic with sigsegv
      
      I0320 12:02:55.806321       1 azure_backoff.go:91] GetVirtualMachineWithRetry(worker-e32ads-westeurope2-f72dr): backoff success
      I0320 12:02:56.028287       1 azure_wrap.go:201] Virtual machine "worker-e16as-westeurope1-hpz2t" is under deleting
      I0320 12:02:56.028328       1 azure_standard.go:752] GetPrimaryInterface(worker-e16as-westeurope1-hpz2t, ) abort backoff
      E0320 12:02:56.028334       1 azure_standard.go:825] error: az.EnsureHostInPool(worker-e16as-westeurope1-hpz2t), az.VMSet.GetPrimaryInterface.Get(worker-e16as-westeurope1-hpz2t, ), err=instance not found
      panic: runtime error: invalid memory address or nil pointer dereference
      [signal SIGSEGV: segmentation violation code=0x1 addr=0x60 pc=0x33d21f6]goroutine 240642 [running]:
      k8s.io/legacy-cloud-providers/azure.(*availabilitySet).EnsureHostInPool(0xc000016580, 0xc0262fb400, {0xc02d8a5080, 0x32}, {0xc021c1bc70, 0xc4}, {0x0, 0x0}, 0xa8?)
              vendor/k8s.io/legacy-cloud-providers/azure/azure_standard.go:831 +0x4f6
      k8s.io/legacy-cloud-providers/azure.(*availabilitySet).EnsureHostsInPool.func2()
              vendor/k8s.io/legacy-cloud-providers/azure/azure_standard.go:928 +0x5f
      k8s.io/apimachinery/pkg/util/errors.AggregateGoroutines.func1(0xc0159d0788?)
      

      Version-Release number of selected component (if applicable):

          4.12.48
      

      (ships https://github.com/openshift/kubernetes/commit/6df21776c7879727ab53895df8a03e53fb725d74)
      issue introduced by https://github.com/kubernetes/kubernetes/pull/111428/files#diff-0414c3aba906b2c0cdb2f09da32bd45c6bf1df71cbb2fc55950743c99a4a5fe4

      How reproducible:

          was unable to reproduce, happens occasionally
      

      Steps to Reproduce:

          1.
          2.
          3.
          

      Actual results:

          panic

      Expected results:

          no panic

      Additional info:

          internal case 03772590

              rh-ee-nbrubake Nolan Brubaker
              frigault Francois Rigault
              Zhaohua Sun Zhaohua Sun
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated:
                Resolved: