Loading...

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.19, 4.20, 4.21
Component/s: Machine Config Operator
Labels:
- mco-triaged

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Low
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem

Drain-failure Events should be successfully emitted, but in 4.21 CI runs like this one, we instead see logged lines like:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.21-ocp-e2e-upgrade-aws-ovn-arm64/1983414271299555328/artifacts/ocp-e2e-upgrade-aws-ovn-arm64/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-controller-649cdc7d77-s9m4g_machine-config-controller.log | grep 'will not report event' | tail -n1
E1029 10:21:43.721757       1 event.go:442] "Could not construct reference, will not report event" err="no kind is registered for the type v1.Node in scheme \"github.com/openshift/client-go/machineconfiguration/clientset/versioned/scheme/register.go:15\"" object="&Node{ObjectMeta:{ip-10-0-73-59.us-west-2.compute.internal    836698f0-48a3-49f7-849a-624c2ac9add8 103380 0 2025-10-29 06:34:15 +0000 UTC...Features:&NodeFeatures{SupplementalGroupsPolicy:*true,},},}" eventType="Warning" reason="DrainThresholdExceeded" message="Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod \"pod-network-to-pod-network-disruption-poller-564f6885f5-q9qjp\" in namespace \"e2e-pod-network-disruption-test-sc25z\" to terminate: context deadline exceeded"

Version-Release number of selected component

Seen in 4.21 CI. Likely all 4.19 an later releases that include ~~MCO-81~~'s mco#4726.

How reproducible

I haven't tried, but it looks like it should be every time.

Steps to Reproduce

Set a PodDisruptionBudget that prevents drains, such as:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  namespace: openshift-ingress
  name: test
spec:
  maxUnavailable: 0
  selector:
    matchLabels:
      ingresscontroller.operator.openshift.io/deployment-ingresscontroller: default

Add a MachineConfig or otherwise do something that bumps the worker MachineConfigPool.

Watch for the Nodes hosting the ingress Pods to stick draining.

Check for Events about the failed Node drain.

Actual results

Machine-config controller logs like Could not construct reference, will not report event" err="no kind is registered for the type v1.Node in scheme.

Expected results

Events about the failed Node drain exist.

Additional information

Making finding Events for Node a bit more exciting, Events are a namespaced resource, and there's currently no standard policy about which namespace to use when Eventing about a cluster-scoped resource like Nodes. From the CI run I linked in the Description:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.21-ocp-e2e-upgrade-aws-ovn-arm64/1983414271299555328/artifacts/ocp-e2e-upgrade-aws-ovn-arm64/gather-extra/artifacts/events.json | jq -r '.items[] | select(.involvedObject.kind == "Node") | .involvedObject.namespace + " " + .metadata.namespace + " " + (.source | del(.host) | tostring)' | sort | uniq -c
      6  default {"component":"cloud-node-controller"}
     56  default {"component":"kubelet"}
      5  default {"component":"machineconfigdaemon"}
     62  default {"component":"node-controller"}
      1  default {"component":"ovnk-controlplane"}
      2 openshift-kube-apiserver openshift-kube-apiserver {"component":"cert-regeneration-controller"}
      6 openshift-kube-controller-manager openshift-kube-controller-manager {"component":"cert-recovery-controller"}
    128 openshift-machine-config-operator openshift-machine-config-operator {"component":"machineconfigdaemon"}
     28 openshift-machine-config-operator openshift-machine-config-operator {"component":"machine-config-operator"}
      6 openshift-network-diagnostics openshift-network-diagnostics {"component":"check-endpoint"}

So at the moment, even the machine-config daemon is split between default and openshift-machine-config-operator as the appropriate namespace for eventing on Nodes. Picking out one example MCD-generated Event from each namespace:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.21-ocp-e2e-upgrade-aws-ovn-arm64/1983414271299555328/artifacts/ocp-e2e-upgrade-aws-ovn-arm64/gather-extra/artifacts/events.json | jq -c '[.items[] | select(.involvedObject.kind == "Node" and .source.component == "machineconfigdaemon" and .metadata.namespace == "default")] | sort_by(.firstTimestamp)[0] | {firstTimestamp, source, reason, message}'
{"firstTimestamp":"2025-10-29T08:06:40Z","source":{"component":"machineconfigdaemon","host":"ip-10-0-16-93.us-west-2.compute.internal"},"reason":"OSUpdateStaged","message":"Changes to OS staged"}
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-multiarch-master-nightly-4.21-ocp-e2e-upgrade-aws-ovn-arm64/1983414271299555328/artifacts/ocp-e2e-upgrade-aws-ovn-arm64/gather-extra/artifacts/events.json | jq -c '[.items[] | select(.involvedObject.kind == "Node" and .source.component == "machineconfigdaemon" and .metadata.namespace == "openshift-machine-config-operator")] | sort_by(.firstTimestamp)[0] | {firstTimestamp, source, reason, message}'
{"firstTimestamp":"2025-10-29T06:38:22Z","source":{"component":"machineconfigdaemon","host":"ip-10-0-73-59.us-west-2.compute.internal"},"reason":"Uncordon","message":"Update completed for config rendered-master-cc6437431424419596269606b794d491 and node has been uncordoned"}

is related to

MCO-81 MCD: emit earlier events to warn about failing drains

Closed

Details

Description

Description of problem

Version-Release number of selected component

How reproducible

Steps to Reproduce

Actual results

Expected results

Additional information

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates