Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-13860

ControllerConfig fails to sync after Infrastructure (and most likely DNS) embedded objects are updated


      Description of problem:

      ControllerConfig renders properly until Infrastructure object changes, then:
      - 'Kind' and 'APIVersion' are no longer present on the object resulting from a "get" for that object via the lister and
      - as a result, the embedded dns and infrastructure objects in ControllerConfig fail to validate 
      - this results in ControllerConfig failing to sync 

      Version-Release number of selected component (if applicable):

      4.14 machine-config-operator

      How reproducible:

      I can reproduce it every time 

      Steps to Reproduce:

      1.Build a 4.14 cluster
      2.Update Infrastructure non-destructively, e.g.: oc annotate infrastructure cluster break.the.mco=yep
      3.Watch the machine-config-operator pod logs (or oc get co, the error will propagate) to see the validation errors for the new controllerconfig

      Actual results:

      2023-05-17T20:45:04.627320107Z I0517 20:45:04.627281       1 event.go:285] Event(v1.ObjectReference{Kind:"", Namespace:"", Name:"machine-config", UID:"d52d09f4-f7bb-497a-a5c3-92861aa6796f", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'OperatorDegraded: MachineConfigControllerFailed' Failed to resync 4.14.0-0.ci.test-2023-05-17-193937-ci-op-dcrr8kjq-latest because: ControllerConfig.machineconfiguration.openshift.io "machine-config-controller" is invalid: [spec.infra.apiVersion: Required value: must not be empty, spec.infra.kind: Required value: must not be empty, <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]

      Expected results:

      machine-config-operator quietly syncs controllerconfig :) 

      Additional info:

      The MCO itself is not doing this. It's not part of resourcemerge or anything like that. It's happening "below" us. 
      The short version here is that when using a typed client, the group,version,kind (GVK) gets stripped during decoding because it's redundant (you already know the type). For "top level" objects, it gets put back during an update request automatically, but it doesn't recurse into embedded objects (which Infrastructure and DNS are). So we end up with embedded objects that are missing explicit GVKs and won't validate. 
      Why does it only happen after the objects change? We're using a lister, and the lister's "strip-on-decode" behavior seems a little inconsistent. Sometimes the GVK is populated. If you use a direct client "get", the GVK will never be populated. 
      There is a lot of history on this behavior, it won't be changed any time soon, here are some entry points: 
      - https://github.com/kubernetes/kubernetes/pull/63972
      - https://github.com/kubernetes/kubernetes/issues/80609

            jkyros@redhat.com John Kyros
            jkyros@redhat.com John Kyros
            Sergio Regidor de la Rosa Sergio Regidor de la Rosa
            0 Vote for this issue
            5 Start watching this issue