[OCPBUGS-33129] Panic when we remove an OCL infra MCP and we try to create new ones with different names

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: None
Affects Version/s: premerge, 4.16
Component/s: Machine Config Operator
Labels:

Severity:
Moderate
Regression:
None
Sprint:
MCO Sprint 254, MCO Sprint 255
sprint_count:
2
Release Blocker:
Approved
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
* Previously, a potential panic seen in Machine Config Controller and Machine Build Controller objects resulted from de-reference accidentally deleted MachineOSConfig/MachineOSBuild to read build status. The panic is controlled with additional error conditions to warn wallowed MachineOSConfig deletions. (link:https://issues.redhat.com/browse/OCPBUGS-33129[*~~OCPBUGS-33129~~])

Show
* Previously, a potential panic seen in Machine Config Controller and Machine Build Controller objects resulted from de-reference accidentally deleted MachineOSConfig/MachineOSBuild to read build status. The panic is controlled with additional error conditions to warn wallowed MachineOSConfig deletions. (link: https://issues.redhat.com/browse/OCPBUGS-33129 [* OCPBUGS-33129 ])
Release Note Type:
Bug Fix
Release Note Status:
Done
Target Version:

4.17.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

Given that we create a new pool, and we enable OCB in this pool, and we remove the pool and the MachineOSConfig resource, and we create another new pool to enable OCB again, then the controller pod panics.

Version-Release number of selected component (if applicable):

pre-merge https://github.com/openshift/machine-config-operator/pull/4327

How reproducible:

Always

Steps to Reproduce:

    1. Create a new infra MCP

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: infra
spec:
  machineConfigSelector:
    matchExpressions:
      - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,infra]}
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/infra: ""


    2. Create a MachineOSConfig for infra pool

oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1alpha1
kind: MachineOSConfig
metadata:
  name: infra
spec:
  machineConfigPool:
    name: infra
  buildInputs:
    imageBuilder:
      imageBuilderType: PodImageBuilder
    baseImagePullSecret:
      name: $(oc get secret -n openshift-config pull-secret -o json | jq "del(.metadata.namespace, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid, .metadata.name)" | jq '.metadata.name="pull-copy"' | oc -n openshift-machine-config-operator create -f - &> /dev/null; echo -n "pull-copy")
    renderedImagePushSecret:
      name: $(oc get -n openshift-machine-config-operator sa builder -ojsonpath='{.secrets[0].name}')
    renderedImagePushspec: "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image:latest"
EOF


    3. When the build is finished, remove the MachineOSConfig and the pool

oc delete machineosconfig infra
oc delete mcp infra

    4. Create a new infra1 pool
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: infra1
spec:
  machineConfigSelector:
    matchExpressions:
      - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,infra1]}
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/infra1: ""

    5. Create a new machineosconfig for infra1 pool

oc create -f - << EOF
apiVersion: machineconfiguration.openshift.io/v1alpha1
kind: MachineOSConfig
metadata:
  name: infra1
spec:
  machineConfigPool:
    name: infra1
  buildInputs:
    imageBuilder:
      imageBuilderType: PodImageBuilder
    baseImagePullSecret:
      name: $(oc get secret -n openshift-config pull-secret -o json | jq "del(.metadata.namespace, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid, .metadata.name)" | jq '.metadata.name="pull-copy"' | oc -n openshift-machine-config-operator create -f - &> /dev/null; echo -n "pull-copy")
    renderedImagePushSecret:
      name: $(oc get -n openshift-machine-config-operator sa builder -ojsonpath='{.secrets[0].name}')
    renderedImagePushspec: "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image:latest"
    containerFile:
    - containerfileArch: noarch
      content: |-
        RUN echo 'test image' > /etc/test-image.file
EOF

Actual results:

The MCO controller pod panics (in updateMachineOSBuild):

E0430 11:21:03.779078       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 265 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x3547bc0?, 0x53ebb20})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc00035e000?})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b
panic({0x3547bc0?, 0x53ebb20?})
	/usr/lib/golang/src/runtime/panic.go:914 +0x21f
github.com/openshift/api/machineconfiguration/v1.(*MachineConfigPool).GetNamespace(0x53f6200?)
	<autogenerated>:1 +0x9
k8s.io/client-go/tools/cache.MetaObjectToName({0x3e2a8f8, 0x0})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:131 +0x25
k8s.io/client-go/tools/cache.ObjectToName({0x3902740?, 0x0?})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:126 +0x74
k8s.io/client-go/tools/cache.MetaNamespaceKeyFunc({0x3902740?, 0x0?})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:112 +0x3e
k8s.io/client-go/tools/cache.DeletionHandlingMetaNamespaceKeyFunc({0x3902740?, 0x0?})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/controller.go:336 +0x3b
github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).enqueueAfter(0xc0007097a0, 0x0, 0x0?)
	/go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:761 +0x33
github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).enqueueDefault(...)
	/go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:772
github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).updateMachineOSBuild(0xc0007097a0, {0xc001c37800?, 0xc000029678?}, {0x3904000?, 0xc0028361a0})
	/go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:395 +0xd1
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate(...)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/controller.go:246
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:970 +0xea
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0005e5738?, {0x3de6020, 0xc0008fe780}, 0x1, 0xc0000ac720)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x6974616761706f72?, 0x3b9aca00, 0x0, 0x69?, 0xc0005e5788?)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161
k8s.io/client-go/tools/cache.(*processorListener).run(0xc000b97c20)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:966 +0x69
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x4f
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 248
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x73
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x210a6e9]



When the controller pod is restarted, it panics again, but in a different function (addMachineOSBuild):

E0430 11:26:54.753689       1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 97 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x3547bc0?, 0x53ebb20})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x15555555aa?})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b
panic({0x3547bc0?, 0x53ebb20?})
	/usr/lib/golang/src/runtime/panic.go:914 +0x21f
github.com/openshift/api/machineconfiguration/v1.(*MachineConfigPool).GetNamespace(0x53f6200?)
	<autogenerated>:1 +0x9
k8s.io/client-go/tools/cache.MetaObjectToName({0x3e2a8f8, 0x0})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:131 +0x25
k8s.io/client-go/tools/cache.ObjectToName({0x3902740?, 0x0?})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:126 +0x74
k8s.io/client-go/tools/cache.MetaNamespaceKeyFunc({0x3902740?, 0x0?})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/store.go:112 +0x3e
k8s.io/client-go/tools/cache.DeletionHandlingMetaNamespaceKeyFunc({0x3902740?, 0x0?})
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/controller.go:336 +0x3b
github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).enqueueAfter(0xc000899560, 0x0, 0x0?)
	/go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:761 +0x33
github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).enqueueDefault(...)
	/go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:772
github.com/openshift/machine-config-operator/pkg/controller/node.(*Controller).addMachineOSBuild(0xc000899560, {0x3904000?, 0xc0006a8b60})
	/go/src/github.com/openshift/machine-config-operator/pkg/controller/node/node_controller.go:386 +0xc5
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/controller.go:239
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:972 +0x13e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00066bf38?, {0x3de6020, 0xc0008f8b40}, 0x1, 0xc000c2ea20)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0xc00066bf88?)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161
k8s.io/client-go/tools/cache.(*processorListener).run(0xc000ba6240)
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/client-go/tools/cache/shared_informer.go:966 +0x69
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x4f
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 43
	/go/src/github.com/openshift/machine-config-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x73
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x210a6e9]

Expected results:

No panic should happen. Errors should be controlled.

Additional info:

    In order to recover from this panic, we need to  manually delete the MachineOSBuild resources that are related to the pool that does not exist anymore.

blocks

OCPBUGS-35299 Panic when we remove an OCL infra MCP and we try to create new ones with different names

Closed

is cloned by

OCPBUGS-35299 Panic when we remove an OCL infra MCP and we try to create new ones with different names

Closed

relates to

MCO-665 On-Cluster Layering Tech Preview

Closed

links to

openshift/machine-config-operator#4396: OCPBUGS-33129: Panic when we remove an OCB infra MCP and we try to create new ones with different names

RHEA-2024:3718 OpenShift Container Platform 4.17.z bug fix update

Pinned comments

Pinned by David Eads

David Eads added a comment - 2024/06/06 3:23 PM

Updated for release blocker approved. All panics are release blockers because they can result in the product being unable to progress.

David Eads added a comment - 2024/06/06 3:23 PM Updated for release blocker approved. All panics are release blockers because they can result in the product being unable to progress.

All comments

Errata Tool added a comment - 2024/10/01 5:34 PM

Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

For information on the advisory (Moderate: OpenShift Container Platform 4.17.0 bug fix and security update), and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2024:3718

Errata Tool added a comment - 2024/10/01 5:34 PM Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Moderate: OpenShift Container Platform 4.17.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:3718

Sergio Regidor de la Rosa added a comment - 2024/06/14 8:29 AM

The fix is present in release: 4.17.0-0.nightly-2024-06-13-010514

We move the status to VERIFIED.

Sergio Regidor de la Rosa added a comment - 2024/06/14 8:29 AM The fix is present in release: 4.17.0-0.nightly-2024-06-13-010514 We move the status to VERIFIED.

OpenShift Jira Bot added a comment - 2024/06/11 7:59 PM

Hi rh-ee-iqian,

Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

OpenShift Jira Bot added a comment - 2024/06/11 7:59 PM Hi rh-ee-iqian , Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

Sergio Regidor de la Rosa added a comment - 2024/06/11 10:27 AM

Pre-merge verified in: https://github.com/openshift/machine-config-operator/pull/4396#issuecomment-2160388072

Sergio Regidor de la Rosa added a comment - 2024/06/11 10:27 AM Pre-merge verified in: https://github.com/openshift/machine-config-operator/pull/4396#issuecomment-2160388072

Sergio Regidor de la Rosa added a comment - 2024/06/10 2:37 PM - edited

Note about the reproduce steps.

The steps mentioned in this ticket's description are not valid anymore to consistently reproduce the issue. When those steps were written the MOSB resources were not garbage collected and because of that we could trigger this panic using those steps.

Nevertheless, we can still consistently reproduce this panic if we execute these steps instead:

1. Create an infra MCP
2. Create a MOSC resource pointing to the new infra pool
3. Wait until the build pod is created and is running
4. Delete the MOSC resource
5. We can see the panic in the controller pod

These steps are very similar to the behaviour that's triggering the panic in the CI jobs. TestOnClusterBuildsCustomPodBuilder e2e test times out because the build does not end and it removes the MSOC while the build is already running.

Sergio Regidor de la Rosa added a comment - 2024/06/10 2:37 PM - edited Note about the reproduce steps. The steps mentioned in this ticket's description are not valid anymore to consistently reproduce the issue. When those steps were written the MOSB resources were not garbage collected and because of that we could trigger this panic using those steps. Nevertheless, we can still consistently reproduce this panic if we execute these steps instead: 1. Create an infra MCP 2. Create a MOSC resource pointing to the new infra pool 3. Wait until the build pod is created and is running 4. Delete the MOSC resource 5. We can see the panic in the controller pod These steps are very similar to the behaviour that's triggering the panic in the CI jobs. TestOnClusterBuildsCustomPodBuilder e2e test times out because the build does not end and it removes the MSOC while the build is already running.

Sam Batschelet added a comment - 2024/06/06 3:24 PM

This panic has been reproduced more generally in Ci[1]

[1] https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_machine-config-operator/4359/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-techpreview/1798704994061389824

Sam Batschelet added a comment - 2024/06/06 3:24 PM This panic has been reproduced more generally in Ci [1] [1] https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_machine-config-operator/4359/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-techpreview/1798704994061389824

Pinned by David Eads

David Eads added a comment - 2024/06/06 3:23 PM

Updated for release blocker approved. All panics are release blockers because they can result in the product being unable to progress.

David Eads added a comment - 2024/06/06 3:23 PM Updated for release blocker approved. All panics are release blockers because they can result in the product being unable to progress.

Sergio Regidor de la Rosa added a comment - 2024/05/27 12:34 PM

I've seen this panic happening too when I create a MOSC resource for an infra pool and I simply remove it before the build has finished.

Sergio Regidor de la Rosa added a comment - 2024/05/27 12:34 PM I've seen this panic happening too when I create a MOSC resource for an infra pool and I simply remove it before the build has finished.

Assignee:: Ines Qian (Inactive)

Reporter:: Sergio Regidor de la Rosa

QA Contact:: Sergio Regidor de la Rosa

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2024/04/30 12:06 PM

Updated:: 2024/10/01 5:34 PM

Resolved:: 2024/10/01 5:34 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Pinned by David Eads David Eads added a comment - 2024/06/06 3:23 PM

Expand comment: Pinned by David Eads David Eads added a comment - 2024/06/06 3:23 PM

Collapse comment: Errata Tool added a comment - 2024/10/01 5:34 PM

Expand comment: Errata Tool added a comment - 2024/10/01 5:34 PM

Collapse comment: Sergio Regidor de la Rosa added a comment - 2024/06/14 8:29 AM

Expand comment: Sergio Regidor de la Rosa added a comment - 2024/06/14 8:29 AM

Collapse comment: OpenShift Jira Bot added a comment - 2024/06/11 7:59 PM

Expand comment: OpenShift Jira Bot added a comment - 2024/06/11 7:59 PM

Collapse comment: Sergio Regidor de la Rosa added a comment - 2024/06/11 10:27 AM

Expand comment: Sergio Regidor de la Rosa added a comment - 2024/06/11 10:27 AM

Collapse comment: Sergio Regidor de la Rosa added a comment - 2024/06/10 2:37 PM, Edited by Sergio Regidor de la Rosa - 2024/06/10 2:37 PM

Expand comment: Sergio Regidor de la Rosa added a comment - 2024/06/10 2:37 PM, Edited by Sergio Regidor de la Rosa - 2024/06/10 2:37 PM

Collapse comment: Sam Batschelet added a comment - 2024/06/06 3:24 PM

Expand comment: Sam Batschelet added a comment - 2024/06/06 3:24 PM

Collapse comment: Pinned by David Eads David Eads added a comment - 2024/06/06 3:23 PM

Expand comment: Pinned by David Eads David Eads added a comment - 2024/06/06 3:23 PM

Collapse comment: Sergio Regidor de la Rosa added a comment - 2024/05/27 12:34 PM

Expand comment: Sergio Regidor de la Rosa added a comment - 2024/05/27 12:34 PM

People

Dates