Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-32414

control-plane-machine-set operator pod stuck into crashloopbackoff state with the nil pointer dereference runtime error

XMLWordPrintable

    • No
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, the control plane machine sets (CPMS) Operator was not correctly handling older {product-title} version configurations that had a vSphere defintion in the infrastructure custom resource. This would cause cluster upgrade operations to fail and the CPMS Operator to remain in a `crashloopback` state. With this release, the cluster upgrade oprations do not fail because of this issue. (link:https://issues.redhat.com/browse/OCPBUGS-32414[*OCPBUGS-32414*]
      Show
      * Previously, the control plane machine sets (CPMS) Operator was not correctly handling older {product-title} version configurations that had a vSphere defintion in the infrastructure custom resource. This would cause cluster upgrade operations to fail and the CPMS Operator to remain in a `crashloopback` state. With this release, the cluster upgrade oprations do not fail because of this issue. (link: https://issues.redhat.com/browse/OCPBUGS-32414 [* OCPBUGS-32414 *]
    • Bug Fix
    • Done

      Backport for 4.15 - Manually Cloned from https://issues.redhat.com/browse/OCPBUGS-31808

      Description of problem:

      control-plane-machine-set operator pod stuck into crashloopbackoff state with panic: runtime error: invalid memory address or nil pointer dereference while extracting the failureDomain from the controlplanemachineset. Below is the error trace for reference.
      ~~~
      2024-04-04T09:32:23.594257072Z I0404 09:32:23.594176       1 controller.go:146]  "msg"="Finished reconciling control plane machine set" "controller"="controlplanemachinesetgenerator" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="c282f3e3-9f9d-40df-a24e-417ba2ea4106"
      2024-04-04T09:32:23.594257072Z I0404 09:32:23.594221       1 controller.go:125]  "msg"="Reconciling control plane machine set" "controller"="controlplanemachinesetgenerator" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="7f03c05f-2717-49e0-95f8-3e8b2ce2fc55"
      2024-04-04T09:32:23.594274974Z I0404 09:32:23.594257       1 controller.go:146]  "msg"="Finished reconciling control plane machine set" "controller"="controlplanemachinesetgenerator" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="7f03c05f-2717-49e0-95f8-3e8b2ce2fc55"
      2024-04-04T09:32:23.597509741Z I0404 09:32:23.597426       1 watch_filters.go:179] reconcile triggered by infrastructure change
      2024-04-04T09:32:23.606311553Z I0404 09:32:23.606243       1 controller.go:220]  "msg"="Starting workers" "controller"="controlplanemachineset" "worker count"=1
      2024-04-04T09:32:23.606360950Z I0404 09:32:23.606340       1 controller.go:169]  "msg"="Reconciling control plane machine set" "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="5dac54f4-57ab-419b-b258-79136ca8b400"
      2024-04-04T09:32:23.609322467Z I0404 09:32:23.609217       1 panic.go:884]  "msg"="Finished reconciling control plane machine set" "controller"="controlplanemachineset" "name"="cluster" "namespace"="openshift-machine-api" "reconcileID"="5dac54f4-57ab-419b-b258-79136ca8b400"
      2024-04-04T09:32:23.609322467Z I0404 09:32:23.609271       1 controller.go:115]  "msg"="Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference" "controller"="controlplanemachineset" "reconcileID"="5dac54f4-57ab-419b-b258-79136ca8b400"
      2024-04-04T09:32:23.612540681Z panic: runtime error: invalid memory address or nil pointer dereference [recovered]
      2024-04-04T09:32:23.612540681Z     panic: runtime error: invalid memory address or nil pointer dereference
      2024-04-04T09:32:23.612540681Z [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1a5911c]
      2024-04-04T09:32:23.612540681Z 
      2024-04-04T09:32:23.612540681Z goroutine 255 [running]:
      2024-04-04T09:32:23.612540681Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
      2024-04-04T09:32:23.612571624Z     /go/src/github.com/openshift/cluster-control-plane-machine-set-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:116 +0x1fa
      2024-04-04T09:32:23.612571624Z panic({0x1c8ac60, 0x31c6ea0})
      2024-04-04T09:32:23.612571624Z     /usr/lib/golang/src/runtime/panic.go:884 +0x213
      2024-04-04T09:32:23.612571624Z github.com/openshift/cluster-control-plane-machine-set-operator/pkg/machineproviders/providers/openshift/machine/v1beta1/providerconfig.VSphereProviderConfig.ExtractFailureDomain(...)
      2024-04-04T09:32:23.612571624Z     /go/src/github.com/openshift/cluster-control-plane-machine-set-operator/pkg/machineproviders/providers/openshift/machine/v1beta1/providerconfig/vsphere.go:120
      2024-04-04T09:32:23.612571624Z github.com/openshift/cluster-control-plane-machine-set-operator/pkg/machineproviders/providers/openshift/machine/v1beta1/providerconfig.providerConfig.ExtractFailureDomain({{0x1f2a71a, 0x7}, {{{{...}, {...}}, {{...}, {...}, {...}, {...}, {...}, {...}, ...}, ...}}, ...})
      2024-04-04T09:32:23.612588145Z     /go/src/github.com/openshift/cluster-control-plane-machine-set-operator/pkg/machineproviders/providers/openshift/machine/v1beta1/providerconfig/providerconfig.go:212 +0x23c
      ~~~
          

      Version-Release number of selected component (if applicable):

          

      How reproducible:

          

      Steps to Reproduce:

          1.
          2.
          3.
          

      Actual results:

      control-plane-machine-set operator stuck into crashloopback off state while cluster upgrade.
          

      Expected results:

      control-plane-machine-set operator should be upgraded without any errors.
          

      Additional info:

      This is happening during the cluster upgrade of Vsphere IPI cluster from OCP version 4.14.z to 4.15.6 and may impact other z stream releases. 
      from the official docs[1]  I see providing the failure domain for the Vsphere platform is tech preview feature.
      [1] https://docs.openshift.com/container-platform/4.15/machine_management/control_plane_machine_management/cpmso-configuration.html#cpmso-yaml-failure-domain-vsphere_cpmso-configuration
          

            rhn-support-ngirard Neil Girard
            rhn-support-nkashyap Nirupma Nirupma
            Huali Liu Huali Liu
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: