Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.12
Component/s: Cluster Resource Override Admission Operator
Labels:
- backport-requested
- triaged

Severity:
Moderate
Regression:
No
Story Points:
3
Sprint:
WINC - Sprint 241, WINC - Sprint 242
sprint_count:
2
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.14.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

Description of problem:

If a pod was created before the cluster resource override operator was enabled in a namespace and it is enabled later, the cluster resource override operator may try to fix the resources in the spec whenever a patch operation for the pod object happens, even if the pod is running and such a patch in the spec is forbidden. This kind of patch attempt is rejected by the kube-apiserver at a validation phase (that happens after mutating webhooks mutated the request) and it fully invalidates the original patch. This can be really problematic under some scenarios.

The most problematic situation in which this can happen is when a pod has a finalizer set and something (or somebody) tries to remove it. In this case, the cluster resource override operator intercepts the patch to remove the finalized from the metadata, imposes a spec modification and then the validation fails, making impossible to remove the pod finalizers and making the pod being stuck in the API forever.

This can become even worse if the pod deletion is part of a drain during a cluster upgrade, because it blocks the upgrade.

Version-Release number of selected component (if applicable):

Tested on 4.12.0-202305262042

How reproducible:

Always

Steps to Reproduce:

1. Start with cluster resource override operator installed and configured and a namespace WITHOUT clusterresourceoverrides.admission.autoscaling.openshift.io/enabled=true label.
2. Create a job which take several minutes to complete (e.g. a sleep), has a pod template spec that violates the cluster resource overrides operator configuration and has the batch.kubernetes.io/job-tracking annotation (this forces pods to be created with a job tracking finalizer, more details here: https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-tracking-with-finalizers).
3. While the job is running, add the label clusterresourceoverrides.admission.autoscaling.openshift.io/enabled=true true to the namespace, so it now tracks this namespace.
4. Few minutes later, delete the job
5. When pod is stuck deleting, try to remove the finalizer manually with `oc -n ${NAMESPACE} patch pod/${POD} --type json --patch='[ { "op": "remove", "path": "/metadata/finalizers" } ]'` or look at kube-controller-manager pod logs.

Actual results:


Result is that the pod is stuck terminating forever and finalizer deletion shows something like this (either if you try manually or check kube-controller-manager logs)

The Pod "whatever-12345678-xxxxx" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds`, `spec.tolerations` (only additions to existing tolerations) or `spec.terminationGracePeriodSeconds` (allow it to be set to 1 if it was previously negative)

This error happens although the user (or the kube-controller-manager, while trying to remove the finalized because it already tracked the job status) didn't try to patch the spec.

If we run patch with higher log level and we see the full diff, we see that a spec update is tried and that update tries to reconcile the pod spec to what the cluster resource override operator would have done if the pod was being created at that moment.

Expected results:


Cluster resource override operator to not touch the spec of a pod in the situations where it is forbidden, specially while trying to remove a finalizer.

Additional info:


Temporarily removing the clusterresourceoverrides.admission.autoscaling.openshift.io/enabled=true label works as a workaround, because that stops cluster resource override operator from touching that namespace.

blocks

OCPBUGS-18253 Cluster Resource Override Operator should not override resources while removing finalizers

Closed

is cloned by

OCPBUGS-18253 Cluster Resource Override Operator should not override resources while removing finalizers

Closed

links to

openshift/cluster-resource-override-admission#46: OCPBUGS-15332: Prevent mutation attempts that can't ever succeed

RHSA-2023:5006 OpenShift Container Platform 4.14.z security update

Solution (Knowledge Base)

Assignee:: John Kyros

Reporter:: Pablo Alonso Rodriguez

QA Contact:: Weinan Liu

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2023/06/23 8:49 AM

Updated:: 2024/10/22 9:37 AM

Resolved:: 2023/10/31 1:41 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates