-
Bug
-
Resolution: Done-Errata
-
Normal
-
4.14
-
+
-
Moderate
-
No
-
MCO Sprint 236, MCO Sprint 237
-
2
-
False
-
Description of problem:
ControllerConfig renders properly until Infrastructure object changes, then: - 'Kind' and 'APIVersion' are no longer present on the object resulting from a "get" for that object via the lister and - as a result, the embedded dns and infrastructure objects in ControllerConfig fail to validate - this results in ControllerConfig failing to sync
Version-Release number of selected component (if applicable):
4.14 machine-config-operator
How reproducible:
I can reproduce it every time
Steps to Reproduce:
1.Build a 4.14 cluster 2.Update Infrastructure non-destructively, e.g.: oc annotate infrastructure cluster break.the.mco=yep 3.Watch the machine-config-operator pod logs (or oc get co, the error will propagate) to see the validation errors for the new controllerconfig
Actual results:
2023-05-17T20:45:04.627320107Z I0517 20:45:04.627281 1 event.go:285] Event(v1.ObjectReference{Kind:"", Namespace:"", Name:"machine-config", UID:"d52d09f4-f7bb-497a-a5c3-92861aa6796f", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'OperatorDegraded: MachineConfigControllerFailed' Failed to resync 4.14.0-0.ci.test-2023-05-17-193937-ci-op-dcrr8kjq-latest because: ControllerConfig.machineconfiguration.openshift.io "machine-config-controller" is invalid: [spec.infra.apiVersion: Required value: must not be empty, spec.infra.kind: Required value: must not be empty, <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]
Expected results:
machine-config-operator quietly syncs controllerconfig :)
Additional info:
The MCO itself is not doing this. It's not part of resourcemerge or anything like that. It's happening "below" us. The short version here is that when using a typed client, the group,version,kind (GVK) gets stripped during decoding because it's redundant (you already know the type). For "top level" objects, it gets put back during an update request automatically, but it doesn't recurse into embedded objects (which Infrastructure and DNS are). So we end up with embedded objects that are missing explicit GVKs and won't validate. Why does it only happen after the objects change? We're using a lister, and the lister's "strip-on-decode" behavior seems a little inconsistent. Sometimes the GVK is populated. If you use a direct client "get", the GVK will never be populated. There is a lot of history on this behavior, it won't be changed any time soon, here are some entry points: - https://github.com/kubernetes/kubernetes/pull/63972 - https://github.com/kubernetes/kubernetes/issues/80609
- is related to
-
OCPBUGS-4877 MCO warns unknown fields from ControllerConfig
- Closed
- links to
-
RHEA-2023:5006 rpm