This behavior should not be present in 4.16 forward, since the line of code producing the error was removed in the external cloud provider code. 4.16 is where OCP begins defaulting to the external cloud provider code.
Cloud infra QE team, can you please verify that this behavior is not reproducible in a 4.16 cluster? Once we confirm that, we can begin backporting the fix from 4.15 to 4.12.
Original description of problem:
on Azure, when kube-controller-manager verify whether a machine exists or not, if the machine was already deleted, the code may panic with sigsegv I0320 12:02:55.806321 1 azure_backoff.go:91] GetVirtualMachineWithRetry(worker-e32ads-westeurope2-f72dr): backoff success I0320 12:02:56.028287 1 azure_wrap.go:201] Virtual machine "worker-e16as-westeurope1-hpz2t" is under deleting I0320 12:02:56.028328 1 azure_standard.go:752] GetPrimaryInterface(worker-e16as-westeurope1-hpz2t, ) abort backoff E0320 12:02:56.028334 1 azure_standard.go:825] error: az.EnsureHostInPool(worker-e16as-westeurope1-hpz2t), az.VMSet.GetPrimaryInterface.Get(worker-e16as-westeurope1-hpz2t, ), err=instance not found panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x60 pc=0x33d21f6]goroutine 240642 [running]: k8s.io/legacy-cloud-providers/azure.(*availabilitySet).EnsureHostInPool(0xc000016580, 0xc0262fb400, {0xc02d8a5080, 0x32}, {0xc021c1bc70, 0xc4}, {0x0, 0x0}, 0xa8?) vendor/k8s.io/legacy-cloud-providers/azure/azure_standard.go:831 +0x4f6 k8s.io/legacy-cloud-providers/azure.(*availabilitySet).EnsureHostsInPool.func2() vendor/k8s.io/legacy-cloud-providers/azure/azure_standard.go:928 +0x5f k8s.io/apimachinery/pkg/util/errors.AggregateGoroutines.func1(0xc0159d0788?)
Version-Release number of selected component (if applicable):
4.12.48
(ships https://github.com/openshift/kubernetes/commit/6df21776c7879727ab53895df8a03e53fb725d74)
issue introduced by https://github.com/kubernetes/kubernetes/pull/111428/files#diff-0414c3aba906b2c0cdb2f09da32bd45c6bf1df71cbb2fc55950743c99a4a5fe4
How reproducible:
was unable to reproduce, happens occasionally
Steps to Reproduce:
1. 2. 3.
Actual results:
panic
Expected results:
no panic
Additional info:
internal case 03772590
- blocks
-
OCPBUGS-31467 az.EnsureHostInPool panic when Azure VM instance not found
- Closed
- links to
-
RHBA-2024:6004 OpenShift Container Platform 4.16.z bug fix update