Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-64623

Drain Events should be successfully emitted

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Low
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem

      Drain-failure Events should be successfully emitted, but in 4.21 CI runs like this one, we instead see logged lines like:

      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.21-ocp-e2e-upgrade-aws-ovn-arm64/1983414271299555328/artifacts/ocp-e2e-upgrade-aws-ovn-arm64/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-controller-649cdc7d77-s9m4g_machine-config-controller.log | grep 'will not report event' | tail -n1
      E1029 10:21:43.721757       1 event.go:442] "Could not construct reference, will not report event" err="no kind is registered for the type v1.Node in scheme \"github.com/openshift/client-go/machineconfiguration/clientset/versioned/scheme/register.go:15\"" object="&Node{ObjectMeta:{ip-10-0-73-59.us-west-2.compute.internal    836698f0-48a3-49f7-849a-624c2ac9add8 103380 0 2025-10-29 06:34:15 +0000 UTC...Features:&NodeFeatures{SupplementalGroupsPolicy:*true,},},}" eventType="Warning" reason="DrainThresholdExceeded" message="Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod \"pod-network-to-pod-network-disruption-poller-564f6885f5-q9qjp\" in namespace \"e2e-pod-network-disruption-test-sc25z\" to terminate: context deadline exceeded"
      

      Version-Release number of selected component

      Seen in 4.21 CI. Likely all 4.19 an later releases that include MCO-81's mco#4726.

      How reproducible

      I haven't tried, but it looks like it should be every time.

      Steps to Reproduce

      Set a PodDisruptionBudget that prevents drains, such as:

      apiVersion: policy/v1
      kind: PodDisruptionBudget
      metadata:
        namespace: openshift-ingress
        name: test
      spec:
        maxUnavailable: 0
        selector:
          matchLabels:
            ingresscontroller.operator.openshift.io/deployment-ingresscontroller: default
      

      Add a MachineConfig or otherwise do something that bumps the worker MachineConfigPool.

      Watch for the Nodes hosting the ingress Pods to stick draining.

      Check for Events about the failed Node drain.

      Actual results

      Machine-config controller logs like Could not construct reference, will not report event" err="no kind is registered for the type v1.Node in scheme.

      Expected results

      Events about the failed Node drain exist.

      Additional information

      Making finding Events for Node a bit more exciting, Events are a namespaced resource, and there's currently no standard policy about which namespace to use when Eventing about a cluster-scoped resource like Nodes. From the CI run I linked in the Description:

      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.21-ocp-e2e-upgrade-aws-ovn-arm64/1983414271299555328/artifacts/ocp-e2e-upgrade-aws-ovn-arm64/gather-extra/artifacts/events.json | jq -r '.items[] | select(.involvedObject.kind == "Node") | .involvedObject.namespace + " " + .metadata.namespace + " " + (.source | del(.host) | tostring)' | sort | uniq -c
            6  default {"component":"cloud-node-controller"}
           56  default {"component":"kubelet"}
            5  default {"component":"machineconfigdaemon"}
           62  default {"component":"node-controller"}
            1  default {"component":"ovnk-controlplane"}
            2 openshift-kube-apiserver openshift-kube-apiserver {"component":"cert-regeneration-controller"}
            6 openshift-kube-controller-manager openshift-kube-controller-manager {"component":"cert-recovery-controller"}
          128 openshift-machine-config-operator openshift-machine-config-operator {"component":"machineconfigdaemon"}
           28 openshift-machine-config-operator openshift-machine-config-operator {"component":"machine-config-operator"}
            6 openshift-network-diagnostics openshift-network-diagnostics {"component":"check-endpoint"}
      

      So at the moment, even the machine-config daemon is split between default and openshift-machine-config-operator as the appropriate namespace for eventing on Nodes. Picking out one example MCD-generated Event from each namespace:

      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.21-ocp-e2e-upgrade-aws-ovn-arm64/1983414271299555328/artifacts/ocp-e2e-upgrade-aws-ovn-arm64/gather-extra/artifacts/events.json | jq -c '[.items[] | select(.involvedObject.kind == "Node" and .source.component == "machineconfigdaemon" and .metadata.namespace == "default")] | sort_by(.firstTimestamp)[0] | {firstTimestamp, source, reason, message}'
      {"firstTimestamp":"2025-10-29T08:06:40Z","source":{"component":"machineconfigdaemon","host":"ip-10-0-16-93.us-west-2.compute.internal"},"reason":"OSUpdateStaged","message":"Changes to OS staged"}
      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.21-ocp-e2e-upgrade-aws-ovn-arm64/1983414271299555328/artifacts/ocp-e2e-upgrade-aws-ovn-arm64/gather-extra/artifacts/events.json | jq -c '[.items[] | select(.involvedObject.kind == "Node" and .source.component == "machineconfigdaemon" and .metadata.namespace == "openshift-machine-config-operator")] | sort_by(.firstTimestamp)[0] | {firstTimestamp, source, reason, message}'
      {"firstTimestamp":"2025-10-29T06:38:22Z","source":{"component":"machineconfigdaemon","host":"ip-10-0-73-59.us-west-2.compute.internal"},"reason":"Uncordon","message":"Update completed for config rendered-master-cc6437431424419596269606b794d491 and node has been uncordoned"}
      

              team-mco Team MCO
              trking W. Trevor King
              None
              None
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: