Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-573

[2106733] Machine Controller stuck with Terminated Instances while Provisioning on AWS

XMLWordPrintable

    • None
    • False
    • Hide

      None

      Show
      None

      This bug exists just to make CI robot happy, the previous bugs for 4.12, 4.11, 4.10 were in Bugzilla and are linked in description for the first version handled in JIRA.

      Description of problem:
      During a replacement of worker nodes, we noticed that the machine-controller container, which is deployed as part of the `openshift-machine-api` namespace, would panic when a machine OpenShift was still in "Provisioning" state, but the corresponding AWS instance was already "Terminated".

      ```
      I0628 10:09:02.518169 1 reconciler.go:123] my-super-worker-skghqwd23: deleting machine
      I0628 10:09:03.090641 1 reconciler.go:464] my-super-worker-skghqwd23: Found instance by id: i-11111111111111
      I0628 10:09:03.090662 1 reconciler.go:138] my-super-worker-skghqwd23: found 1 existing instances for machine
      I0628 10:09:03.090669 1 utils.go:231] Cleaning up extraneous instance for machine: i-11111111111111, state: running, launchTime: 2022-06-28 08:56:52 +0000 UTC
      I0628 10:09:03.090682 1 utils.go:235] Terminating i-05332b08d4cc3ab28 instance
      panic: assignment to entry in nil map

      goroutine 125 [running]:
      sigs.k8s.io/cluster-api-provider-aws/pkg/actuators/machine.(*Reconciler).delete(0xc0012df980, 0xc0004bd530, 0x234c4c0)
      /go/src/sigs.k8s.io/cluster-api-provider-aws/pkg/actuators/machine/reconciler.go:165 +0x95b
      sigs.k8s.io/cluster-api-provider-aws/pkg/actuators/machine.(*Actuator).Delete(0xc000a3a900, 0x25db9b8, 0xc0004bd530, 0xc000b9a000, 0x35e0100, 0x0)
      /go/src/sigs.k8s.io/cluster-api-provider-aws/pkg/actuators/machine/actuator.go:171 +0x365
      github.com/openshift/machine-api-operator/pkg/controller/machine.(*ReconcileMachine).Reconcile(0xc0007bc960, 0x25db9b8, 0xc0004bd530, 0xc0007c5fc8, 0x15, 0xc0005e4a80, 0x2a, 0xc0004bd530, 0xc000032000, 0x206d640, ...)
      /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/github.com/openshift/machine-api-operator/pkg/controller/machine/controller.go:231 +0x2352
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0003b20a0, 0x25db910, 0xc00087e040, 0x1feb8e0, 0xc00009f460)
      /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:298 +0x30d
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0003b20a0, 0x25db910, 0xc00087e040, 0x0)
      /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253 +0x205
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2(0xc000a38790, 0xc0003b20a0, 0x25db910, 0xc00087e040)
      /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214 +0x6b
      created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
      /go/src/sigs.k8s.io/cluster-api-provider-aws/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:210 +0x425
      ```

      What is the business impact? Please also provide timeframe information.
      We failed to recover from a major outage due to this bug.

      Where are you experiencing the behavior? What environment?
      Production and all envs.

      When does the behavior occur? Frequency? Repeatedly? At certain times?
      It appeared only once so far, but can appear in larger scaling scenarios.

      Version-Release number of selected component (if applicable):
      4.8.39

      Actual results:

      With the panicing machine-controller, no new instances could be provisioned, resulting in an unscalable cluster. The solution/workaround to the problem was to delete the offending Machines.
      Expected results:
      Make the cluster scaleable again without deleting manually.

      Additional info:

              rmanak@redhat.com Radek Manak
              rmanak@redhat.com Radek Manak
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: