Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-38818

az.EnsureHostInPool panic when Azure VM instance not found

XMLWordPrintable

    • No
    • False
    • Hide

      None

      Show
      None
    • Release Note Not Required
    • In Progress

      This behavior should not be present in 4.16 forward, since the line of code producing the error was removed in the external cloud provider code. 4.16 is where OCP begins defaulting to the external cloud provider code.

      Cloud infra QE team, can you please verify that this behavior is not reproducible in a 4.16 cluster? Once we confirm that, we can begin backporting the fix from 4.15 to 4.12.

      Original description of problem:

          on Azure, when kube-controller-manager verify whether a machine exists or not, if the machine was already deleted, the code may panic with sigsegv
      
      I0320 12:02:55.806321       1 azure_backoff.go:91] GetVirtualMachineWithRetry(worker-e32ads-westeurope2-f72dr): backoff success
      I0320 12:02:56.028287       1 azure_wrap.go:201] Virtual machine "worker-e16as-westeurope1-hpz2t" is under deleting
      I0320 12:02:56.028328       1 azure_standard.go:752] GetPrimaryInterface(worker-e16as-westeurope1-hpz2t, ) abort backoff
      E0320 12:02:56.028334       1 azure_standard.go:825] error: az.EnsureHostInPool(worker-e16as-westeurope1-hpz2t), az.VMSet.GetPrimaryInterface.Get(worker-e16as-westeurope1-hpz2t, ), err=instance not found
      panic: runtime error: invalid memory address or nil pointer dereference
      [signal SIGSEGV: segmentation violation code=0x1 addr=0x60 pc=0x33d21f6]goroutine 240642 [running]:
      k8s.io/legacy-cloud-providers/azure.(*availabilitySet).EnsureHostInPool(0xc000016580, 0xc0262fb400, {0xc02d8a5080, 0x32}, {0xc021c1bc70, 0xc4}, {0x0, 0x0}, 0xa8?)
              vendor/k8s.io/legacy-cloud-providers/azure/azure_standard.go:831 +0x4f6
      k8s.io/legacy-cloud-providers/azure.(*availabilitySet).EnsureHostsInPool.func2()
              vendor/k8s.io/legacy-cloud-providers/azure/azure_standard.go:928 +0x5f
      k8s.io/apimachinery/pkg/util/errors.AggregateGoroutines.func1(0xc0159d0788?)
      

      Version-Release number of selected component (if applicable):

          4.12.48
      

      (ships https://github.com/openshift/kubernetes/commit/6df21776c7879727ab53895df8a03e53fb725d74)
      issue introduced by https://github.com/kubernetes/kubernetes/pull/111428/files#diff-0414c3aba906b2c0cdb2f09da32bd45c6bf1df71cbb2fc55950743c99a4a5fe4

      How reproducible:

          was unable to reproduce, happens occasionally
      

      Steps to Reproduce:

          1.
          2.
          3.
          

      Actual results:

          panic

      Expected results:

          no panic

      Additional info:

          internal case 03772590

            rh-ee-nbrubake Nolan Brubaker
            rh-ee-nbrubake Nolan Brubaker
            Zhaohua Sun Zhaohua Sun
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: