Uploaded image for project: 'OCP Technical Release Team'
  1. OCP Technical Release Team
  2. TRT-1334

Machine Config Operator controllerCertificates Validation Error

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Undefined Undefined
    • None
    • None
    • False
    • None
    • False

      Component Readiness shows 8 regressed components on ovn amd64 gcp.

      Seems to mostly lead back to pathological events tests from several components:

      [sig-arch] events should not repeat pathologically for ns/openshift-authentication
      [sig-arch] events should not repeat pathologically for ns/openshift-dns
      [sig-arch] events should not repeat pathologically for ns/openshift-controller-manager
      [sig-arch] events should not repeat pathologically
      [sig-arch] events should not repeat pathologically for ns/openshift-ovn-kubernetes

      Also looks to be hitting:

      [sig-arch][Feature:ClusterUpgrade] Cluster should be upgradeable after finishing upgrade [Late][Suite:upgrade]

      Examples:

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade/1717050802796761088

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade/1715484618926329856

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade/1716412385557745664

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade/1716600654568361984

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade/1716829054398631936

      Looking into the job runs that are failing we see batches of these pathological events near the end of the upgrade spyglass chart.

      The pathological events are always of the form:

      event happened 22 times, something is wrong: ns/openshift-controller-manager pod/controller-manager-7bfb568887-dxpv9 hmsg/059c489b5a - pathological/true reason/FailedScheduling 0/6 nodes are available: 2 node(s) didn't match Pod's node affinity/selector, 2 node(s) didn't match pod anti-affinity rules, 2 node(s) were unschedulable. preemption: 0/6 nodes are available: 2 node(s) didn't match pod anti-affinity rules, 4 Preemption is not helpful for scheduling.. From: 07:35:29Z To: 07:35:30Z result=reject 
      

      Indicating a node scheduling problem.

      Overlapping these events, clusteroperator/machine-config is Available=false with an pretty severe looking error:

      condition/Degraded reason/MachineConfigControllerFailed status/True Unable to apply 4.15.0-0.nightly-2023-10-25-052621: ControllerConfig.machineconfiguration.openshift.io \"machine-config-controller\" is invalid: [status.controllerCertificates[0].notAfter: Required value, status.controllerCertificates[0].notBefore: Required value, status.controllerCertificates[1].notAfter: Required value, status.controllerCertificates[1].notBefore: Required value, status.controllerCertificates[2].notAfter: Required value, status.controllerCertificates[2].notBefore: Required value, status.controllerCertificates[3].notAfter: Required value, status.controllerCertificates[3].notBefore: Required value, status.controllerCertificates[4].notAfter: Required value, status.controllerCertificates[4].notBefore: Required value, status.controllerCertificates[5].notAfter: Required value, status.controllerCertificates[5].notBefore: Required value, status.controllerCertificates[6].notAfter: Required value, status.controllerCertificates[6].notBefore: Required value, status.controllerCertificates[7].notAfter: Required value, status.controllerCertificates[7].notBefore: Required value, status.controllerCertificates[8].notAfter: Required value, status.controllerCertificates[8].notBefore: Required value, status.controllerCertificates[9].notAfter: Required value, status.controllerCertificates[9].notBefore: Required value, \u003cnil\u003e: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]
      

      We can see a test flaking indicating this specific problem for:
      [bz-Machine Config Operator] clusteroperator/machine-config should not change condition/Available

      Sippy indicates a dramatic uptick in flakes for this test somewhere in the last week, from 0.8% to 21% as of today.

      Clicking through to the test it appears the problem started Oct 20, possibly late on Oct 19. Failure outputs at the bottom of the test page are all this error.

      Problem slipped through aggregation as it's only surfacing on 20-30% of runs.

              rhn-engineering-dgoodwin Devan Goodwin
              rhn-engineering-dgoodwin Devan Goodwin
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: