Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-33129

Panic when we remove an OCL infra MCP and we try to create new ones with different names

    • Moderate
    • None
    • MCO Sprint 254, MCO Sprint 255
    • 2
    • Approved
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, a potential panic seen in Machine Config Controller and Machine Build Controller objects resulted from de-reference accidentally deleted MachineOSConfig/MachineOSBuild to read build status. The panic is controlled with additional error conditions to warn wallowed MachineOSConfig deletions. (link:https://issues.redhat.com/browse/OCPBUGS-33129[*OCPBUGS-33129])
      Show
      * Previously, a potential panic seen in Machine Config Controller and Machine Build Controller objects resulted from de-reference accidentally deleted MachineOSConfig/MachineOSBuild to read build status. The panic is controlled with additional error conditions to warn wallowed MachineOSConfig deletions. (link: https://issues.redhat.com/browse/OCPBUGS-33129 [* OCPBUGS-33129 ])
    • Bug Fix
    • Done

      Description of problem:

      Given that we create a new pool, and we enable OCB in this pool, and we remove the pool and the MachineOSConfig resource, and we create another new pool to enable OCB again, then the controller pod panics.
          

      Version-Release number of selected component (if applicable):

      pre-merge https://github.com/openshift/machine-config-operator/pull/4327
          

      How reproducible:

      Always
          

      Steps to Reproduce:

          1. Create a new infra MCP
      
      apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfigPool
      metadata:
        name: infra
      spec:
        machineConfigSelector:
          matchExpressions:
            - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,infra]}
        nodeSelector:
          matchLabels:
            node-role.kubernetes.io/infra: ""
      
      
          2. Create a MachineOSConfig for infra pool
      
      oc create -f - << EOF
      apiVersion: machineconfiguration.openshift.io/v1alpha1
      kind: MachineOSConfig
      metadata:
        name: infra
      spec:
        machineConfigPool:
          name: infra
        buildInputs:
          imageBuilder:
            imageBuilderType: PodImageBuilder
          baseImagePullSecret:
            name: $(oc get secret -n openshift-config pull-secret -o json | jq "del(.metadata.namespace, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid, .metadata.name)" | jq '.metadata.name="pull-copy"' | oc -n openshift-machine-config-operator create -f - &> /dev/null; echo -n "pull-copy")
          renderedImagePushSecret:
            name: $(oc get -n openshift-machine-config-operator sa builder -ojsonpath='{.secrets[0].name}')
          renderedImagePushspec: "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image:latest"
      EOF
      
      
          3. When the build is finished, remove the MachineOSConfig and the pool
      
      oc delete machineosconfig infra
      oc delete mcp infra
      
          4. Create a new infra1 pool
      apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfigPool
      metadata:
        name: infra1
      spec:
        machineConfigSelector:
          matchExpressions:
            - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,infra1]}
        nodeSelector:
          matchLabels:
            node-role.kubernetes.io/infra1: ""
      
          5. Create a new machineosconfig for infra1 pool
      
      oc create -f - << EOF
      apiVersion: machineconfiguration.openshift.io/v1alpha1
      kind: MachineOSConfig
      metadata:
        name: infra1
      spec:
        machineConfigPool:
          name: infra1
        buildInputs:
          imageBuilder:
            imageBuilderType: PodImageBuilder
          baseImagePullSecret:
            name: $(oc get secret -n openshift-config pull-secret -o json | jq "del(.metadata.namespace, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid, .metadata.name)" | jq '.metadata.name="pull-copy"' | oc -n openshift-machine-config-operator create -f - &> /dev/null; echo -n "pull-copy")
          renderedImagePushSecret:
            name: $(oc get -n openshift-machine-config-operator sa builder -ojsonpath='{.secrets[0].name}')
          renderedImagePushspec: "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image:latest"
          containerFile:
          - containerfileArch: noarch
            content: |-
              RUN echo 'test image' > /etc/test-image.file
      EOF
      
      
      
          

      Actual results:

      The MCO controller pod panics (in updateMachineOSBuild):
      
      E0430 11:21:03.779078       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
      goroutine 265 [running]:
      k8s.io/apimachinery/pkg/util/runtime.logPanic({0x3547bc0?, 0x53ebb20})
      	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85
      k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00035e000?})
      	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b
      panic({0x3547bc0?, 0x53ebb20?})
      	/usr/lib/golang/src/runtime/panic.go:914 +0x21f
      github.com/openshift/api/machineconfiguration/v1.(*MachineConfigPool).GetNamespace(0x53f6200?)
      	<autogenerated>:1 +0x9
      k8s.io/client-go/tools/cache.MetaObjectToName({0x3e2a8f8, 0x0})
      	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:131 +0x25
      k8s.io/client-go/tools/cache.ObjectToName({0x3902740?, 0x0?})
      	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:126 +0x74
      k8s.io/client-go/tools/cache.MetaNamespaceKeyFunc({0x3902740?, 0x0?})
      	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:112 +0x3e
      k8s.io/client-go/tools/cache.DeletionHandlingMetaNamespaceKeyFunc({0x3902740?, 0x0?})
      	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/controller.go:336 +0x3b
      github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).enqueueAfter(0xc0007097a0, 0x0, 0x0?)
      	/go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:761 +0x33
      github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).enqueueDefault(...)
      	/go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:772
      github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).updateMachineOSBuild(0xc0007097a0, {0xc001c37800?, 0xc000029678?}, {0x3904000?, 0xc0028361a0})
      	/go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:395 +0xd1
      k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...)
      	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/controller.go:246
      k8s.io/client-go/tools/cache.(*processorListener).run.func1()
      	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:970 +0xea
      k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
      	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
      k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0005e5738?, {0x3de6020, 0xc0008fe780}, 0x1, 0xc0000ac720)
      	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
      k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x6974616761706f72?, 0x3b9aca00, 0x0, 0x69?, 0xc0005e5788?)
      	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
      k8s.io/apimachinery/pkg/util/wait.Until(...)
      	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161
      k8s.io/client-go/tools/cache.(*processorListener).run(0xc000b97c20)
      	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:966 +0x69
      k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
      	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x4f
      created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 248
      	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x73
      panic: runtime error: invalid memory address or nil pointer dereference [recovered]
      	panic: runtime error: invalid memory address or nil pointer dereference
      [signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x210a6e9]
      
      
      
      When the controller pod is restarted, it panics again, but in a different function (addMachineOSBuild):
      
      E0430 11:26:54.753689       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
      goroutine 97 [running]:
      k8s.io/apimachinery/pkg/util/runtime.logPanic({0x3547bc0?, 0x53ebb20})
      	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85
      k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x15555555aa?})
      	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b
      panic({0x3547bc0?, 0x53ebb20?})
      	/usr/lib/golang/src/runtime/panic.go:914 +0x21f
      github.com/openshift/api/machineconfiguration/v1.(*MachineConfigPool).GetNamespace(0x53f6200?)
      	<autogenerated>:1 +0x9
      k8s.io/client-go/tools/cache.MetaObjectToName({0x3e2a8f8, 0x0})
      	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:131 +0x25
      k8s.io/client-go/tools/cache.ObjectToName({0x3902740?, 0x0?})
      	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:126 +0x74
      k8s.io/client-go/tools/cache.MetaNamespaceKeyFunc({0x3902740?, 0x0?})
      	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:112 +0x3e
      k8s.io/client-go/tools/cache.DeletionHandlingMetaNamespaceKeyFunc({0x3902740?, 0x0?})
      	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/controller.go:336 +0x3b
      github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).enqueueAfter(0xc000899560, 0x0, 0x0?)
      	/go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:761 +0x33
      github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).enqueueDefault(...)
      	/go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:772
      github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).addMachineOSBuild(0xc000899560, {0x3904000?, 0xc0006a8b60})
      	/go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:386 +0xc5
      k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...)
      	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/controller.go:239
      k8s.io/client-go/tools/cache.(*processorListener).run.func1()
      	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:972 +0x13e
      k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
      	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
      k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00066bf38?, {0x3de6020, 0xc0008f8b40}, 0x1, 0xc000c2ea20)
      	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
      k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0xc00066bf88?)
      	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
      k8s.io/apimachinery/pkg/util/wait.Until(...)
      	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161
      k8s.io/client-go/tools/cache.(*processorListener).run(0xc000ba6240)
      	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:966 +0x69
      k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
      	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x4f
      created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 43
      	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x73
      panic: runtime error: invalid memory address or nil pointer dereference [recovered]
      	panic: runtime error: invalid memory address or nil pointer dereference
      [signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x210a6e9]
      
      
      
      
      
          

      Expected results:

      No panic should happen. Errors should be controlled.
      
          

      Additional info:

          In order to recover from this panic, we need to  manually delete the MachineOSBuild resources that are related to the pool that does not exist anymore.

            [OCPBUGS-33129] Panic when we remove an OCL infra MCP and we try to create new ones with different names

            Pinned comments

            Pinned by David Eads

            David Eads added a comment -

            Updated for release blocker approved.  All panics are release blockers because they can result in the product being unable to progress.

            David Eads added a comment - Updated for release blocker approved.  All panics are release blockers because they can result in the product being unable to progress.

            All comments

            Errata Tool added a comment -

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory (Moderate: OpenShift Container Platform 4.17.0 bug fix and security update), and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHSA-2024:3718

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Moderate: OpenShift Container Platform 4.17.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:3718

            The fix is present in release: 4.17.0-0.nightly-2024-06-13-010514

            We move the status to VERIFIED.

            Sergio Regidor de la Rosa added a comment - The fix is present in release: 4.17.0-0.nightly-2024-06-13-010514 We move the status to VERIFIED.

            Hi rh-ee-iqian,

            Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

            OpenShift Jira Bot added a comment - Hi rh-ee-iqian , Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

            Sergio Regidor de la Rosa added a comment - Pre-merge verified in: https://github.com/openshift/machine-config-operator/pull/4396#issuecomment-2160388072

            Note about the reproduce steps.

            The steps mentioned in this ticket's description are not valid anymore to consistently reproduce the issue. When those steps were written the MOSB resources were not garbage collected and because of that we could trigger this panic using those steps.

            Nevertheless, we can still consistently reproduce this panic if we execute these steps instead:

            1. Create an infra MCP
            2. Create a MOSC resource pointing to the new infra pool
            3. Wait until the build pod is created and is running
            4. Delete the MOSC resource
            5. We can see the panic in the controller pod

            These steps are very similar to the behaviour that's triggering the panic in the CI jobs. TestOnClusterBuildsCustomPodBuilder e2e test times out because the build does not end and it removes the MSOC while the build is already running.

            Sergio Regidor de la Rosa added a comment - - edited Note about the reproduce steps. The steps mentioned in this ticket's description are not valid anymore to consistently reproduce the issue. When those steps were written the MOSB resources were not garbage collected and because of that we could trigger this panic using those steps. Nevertheless, we can still consistently reproduce this panic if we execute these steps instead: 1. Create an infra MCP 2. Create a MOSC resource pointing to the new infra pool 3. Wait until the build pod is created and is running 4. Delete the MOSC resource 5. We can see the panic in the controller pod These steps are very similar to the behaviour that's triggering the panic in the CI jobs. TestOnClusterBuildsCustomPodBuilder e2e test times out because the build does not end and it removes the MSOC while the build is already running.

            Sam Batschelet added a comment - This panic has been reproduced more generally in Ci [1] [1] https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_machine-config-operator/4359/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-techpreview/1798704994061389824

            Pinned by David Eads

            David Eads added a comment -

            Updated for release blocker approved.  All panics are release blockers because they can result in the product being unable to progress.

            David Eads added a comment - Updated for release blocker approved.  All panics are release blockers because they can result in the product being unable to progress.

            I've seen this panic happening too when I create a MOSC resource for an infra pool and I simply remove it before the build has finished.

            Sergio Regidor de la Rosa added a comment - I've seen this panic happening too when I create a MOSC resource for an infra pool and I simply remove it before the build has finished.

              rh-ee-iqian Ines Qian (Inactive)
              sregidor@redhat.com Sergio Regidor de la Rosa
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: