-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.19, 4.20, 4.21
-
Quality / Stability / Reliability
-
False
-
-
None
-
Low
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem
Drain-failure Events should be successfully emitted, but in 4.21 CI runs like this one, we instead see logged lines like:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.21-ocp-e2e-upgrade-aws-ovn-arm64/1983414271299555328/artifacts/ocp-e2e-upgrade-aws-ovn-arm64/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-controller-649cdc7d77-s9m4g_machine-config-controller.log | grep 'will not report event' | tail -n1
E1029 10:21:43.721757 1 event.go:442] "Could not construct reference, will not report event" err="no kind is registered for the type v1.Node in scheme \"github.com/openshift/client-go/machineconfiguration/clientset/versioned/scheme/register.go:15\"" object="&Node{ObjectMeta:{ip-10-0-73-59.us-west-2.compute.internal 836698f0-48a3-49f7-849a-624c2ac9add8 103380 0 2025-10-29 06:34:15 +0000 UTC...Features:&NodeFeatures{SupplementalGroupsPolicy:*true,},},}" eventType="Warning" reason="DrainThresholdExceeded" message="Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod \"pod-network-to-pod-network-disruption-poller-564f6885f5-q9qjp\" in namespace \"e2e-pod-network-disruption-test-sc25z\" to terminate: context deadline exceeded"
Version-Release number of selected component
Seen in 4.21 CI. Likely all 4.19 an later releases that include MCO-81's mco#4726.
How reproducible
I haven't tried, but it looks like it should be every time.
Steps to Reproduce
Set a PodDisruptionBudget that prevents drains, such as:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
namespace: openshift-ingress
name: test
spec:
maxUnavailable: 0
selector:
matchLabels:
ingresscontroller.operator.openshift.io/deployment-ingresscontroller: default
Add a MachineConfig or otherwise do something that bumps the worker MachineConfigPool.
Watch for the Nodes hosting the ingress Pods to stick draining.
Check for Events about the failed Node drain.
Actual results
Machine-config controller logs like Could not construct reference, will not report event" err="no kind is registered for the type v1.Node in scheme.
Expected results
Events about the failed Node drain exist.
Additional information
Making finding Events for Node a bit more exciting, Events are a namespaced resource, and there's currently no standard policy about which namespace to use when Eventing about a cluster-scoped resource like Nodes. From the CI run I linked in the Description:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.21-ocp-e2e-upgrade-aws-ovn-arm64/1983414271299555328/artifacts/ocp-e2e-upgrade-aws-ovn-arm64/gather-extra/artifacts/events.json | jq -r '.items[] | select(.involvedObject.kind == "Node") | .involvedObject.namespace + " " + .metadata.namespace + " " + (.source | del(.host) | tostring)' | sort | uniq -c
6 default {"component":"cloud-node-controller"}
56 default {"component":"kubelet"}
5 default {"component":"machineconfigdaemon"}
62 default {"component":"node-controller"}
1 default {"component":"ovnk-controlplane"}
2 openshift-kube-apiserver openshift-kube-apiserver {"component":"cert-regeneration-controller"}
6 openshift-kube-controller-manager openshift-kube-controller-manager {"component":"cert-recovery-controller"}
128 openshift-machine-config-operator openshift-machine-config-operator {"component":"machineconfigdaemon"}
28 openshift-machine-config-operator openshift-machine-config-operator {"component":"machine-config-operator"}
6 openshift-network-diagnostics openshift-network-diagnostics {"component":"check-endpoint"}
So at the moment, even the machine-config daemon is split between default and openshift-machine-config-operator as the appropriate namespace for eventing on Nodes. Picking out one example MCD-generated Event from each namespace:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.21-ocp-e2e-upgrade-aws-ovn-arm64/1983414271299555328/artifacts/ocp-e2e-upgrade-aws-ovn-arm64/gather-extra/artifacts/events.json | jq -c '[.items[] | select(.involvedObject.kind == "Node" and .source.component == "machineconfigdaemon" and .metadata.namespace == "default")] | sort_by(.firstTimestamp)[0] | {firstTimestamp, source, reason, message}'
{"firstTimestamp":"2025-10-29T08:06:40Z","source":{"component":"machineconfigdaemon","host":"ip-10-0-16-93.us-west-2.compute.internal"},"reason":"OSUpdateStaged","message":"Changes to OS staged"}
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.21-ocp-e2e-upgrade-aws-ovn-arm64/1983414271299555328/artifacts/ocp-e2e-upgrade-aws-ovn-arm64/gather-extra/artifacts/events.json | jq -c '[.items[] | select(.involvedObject.kind == "Node" and .source.component == "machineconfigdaemon" and .metadata.namespace == "openshift-machine-config-operator")] | sort_by(.firstTimestamp)[0] | {firstTimestamp, source, reason, message}'
{"firstTimestamp":"2025-10-29T06:38:22Z","source":{"component":"machineconfigdaemon","host":"ip-10-0-73-59.us-west-2.compute.internal"},"reason":"Uncordon","message":"Update completed for config rendered-master-cc6437431424419596269606b794d491 and node has been uncordoned"}
- is related to
-
MCO-81 MCD: emit earlier events to warn about failing drains
-
- Closed
-