Loading...

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: 4.8
Component/s: Cloud Compute / Cloud Controller Manager
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:

4.8.z
Release Blocker:
None
Sprint:
CLOUD Sprint 224
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

This is a clone of issue ~~OCPBUGS-572~~. The following is the description of the original issue:
—
This is a clone of Bug 2117557 to track backport to 4.9.z
+++ This bug was initially created as a clone of
Bug #2108021
+++

+++ This bug was initially created as a clone of
Bug #2106733
+++

Description of problem:
During a replacement of worker nodes, we noticed that the machine-controller container, which is deployed as part of the `openshift-machine-api` namespace, would panic when a machine OpenShift was still in "Provisioning" state, but the corresponding AWS instance was already "Terminated".

```
I0628 10:09:02.518169 1 reconciler.go:123] my-super-worker-skghqwd23: deleting machine
I0628 10:09:03.090641 1 reconciler.go:464] my-super-worker-skghqwd23: Found instance by id: i-11111111111111
I0628 10:09:03.090662 1 reconciler.go:138] my-super-worker-skghqwd23: found 1 existing instances for machine
I0628 10:09:03.090669 1 utils.go:231] Cleaning up extraneous instance for machine: i-11111111111111, state: running, launchTime: 2022-06-28 08:56:52 +0000 UTC
I0628 10:09:03.090682 1 utils.go:235] Terminating i-05332b08d4cc3ab28 instance
panic: assignment to entry in nil map

goroutine 125 [running]:
sigs.k8s.io/cluster-api-provider-aws/pkg/actuators/machine.(*Reconciler).delete(0xc0012df980, 0xc0004bd530, 0x234c4c0)
/go/src/sigs.k8s.io/cluster-api-provider-aws/pkg/actuators/machine/reconciler.go:165 +0x95b
sigs.k8s.io/cluster-api-provider-aws/pkg/actuators/machine.(*Actuator).Delete(0xc000a3a900, 0x25db9b8, 0xc0004bd530, 0xc000b9a000, 0x35e0100, 0x0)
/go/src/sigs.k8s.io/cluster-api-provider-aws/pkg/actuators/machine/actuator.go:171 +0x365
github.com/openshift/machine-api-operator/pkg/controller/machine.(*ReconcileMachine).Reconcile(0xc0007bc960, 0x25db9b8, 0xc0004bd530, 0xc0007c5fc8, 0x15, 0xc0005e4a80, 0x2a, 0xc0004bd530, 0xc000032000, 0x206d640, ...)
/go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/github.com/openshift/machine-api-operator/pkg/controller/machine/controller.go:231 +0x2352
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0003b20a0, 0x25db910, 0xc00087e040, 0x1feb8e0, 0xc00009f460)
/go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298 +0x30d
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0003b20a0, 0x25db910, 0xc00087e040, 0x0)
/go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2(0xc000a38790, 0xc0003b20a0, 0x25db910, 0xc00087e040)
/go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214 +0x6b
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
/go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:210 +0x425
```

What is the business impact? Please also provide timeframe information.
We failed to recover from a major outage due to this bug.

Where are you experiencing the behavior? What environment?
Production and all envs.

When does the behavior occur? Frequency? Repeatedly? At certain times?
It appeared only once so far, but can appear in larger scaling scenarios.

Version-Release number of selected component (if applicable):
4.8.39

Actual results:

With the panicing machine-controller, no new instances could be provisioned, resulting in an unscalable cluster. The solution/workaround to the problem was to delete the offending Machines.
Expected results:
Make the cluster scaleable again without deleting manually.

Additional info:

— Additional comment from
gferrazs@redhat.com
on 2022-07-13 13:34:08 UTC —

Probably the issue is here:

https://github.com/openshift/machine-api-provider-aws/blob/d701bcb720a12bd7d169d79699962c447a1f026d/pkg/actuators/machine/reconciler.go#L416-L426(the
fields referenced are on the file below. Probably duplicate the lines or move here).

https://github.com/openshift/machine-api-provider-aws/blob/d701bcb720a12bd7d169d79699962c447a1f026d/pkg/actuators/machine/reconciler.go#L165
- - Additional comment from
    skumari@redhat.com
    on 2022-07-13 15:22:47 UTC —

Since issue is in machine-api, moving it to correct team.

— Additional comment from
rmanak@redhat.com
on 2022-07-14 08:20:00 UTC —

I am working on a fix for this.

— Additional comment from
aos-team-art-private@redhat.com
on 2022-07-14 19:10:14 UTC —

Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.12 release.

— Additional comment from
jspeed@redhat.com
on 2022-07-18 15:18:16 UTC —

Waiting for the first 4.11.z stream before we merge

— Additional comment from
jspeed@redhat.com
on 2022-08-08 15:07:16 UTC —

Waiting on 4.11 GA to move ahead here

clones

OCPBUGS-572 Machine Controller stuck with Terminated Instances while Provisioning on AWS

Closed

is blocked by

OCPBUGS-572 Machine Controller stuck with Terminated Instances while Provisioning on AWS

Closed

links to

openshift/cluster-api-provider-aws#447: [release-4.8] OCPBUGS-895: Fix panic when accessing nil machine annotations map

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates