Details
-
Bug
-
Resolution: Done
-
Normal
-
None
-
4.13
-
No
-
OCPNODE Sprint 237 (Blue), OCPNODE Sprint 238 (Blue)
-
2
-
False
-
Description
Description of problem:
While trying to upgrade a loaded cluster at 120 nodes (ROSA) one of the control plane nodes fails to drain causing the upgrade to be stuck.
Version-Release number of selected component (if applicable):
4.13.0-rc.4 to 4.13.0-rc6
How reproducible:
Happened on one attempt
Steps to Reproduce:
1. Install a 120 node cluster 2. Load up the cluster using cluster-density-v1 with ITERATIONS=4000 and gc=false (https://github.com/cloud-bulldozer/e2e-benchmarking/tree/master/workloads/kube-burner-ocp-wrapper) 3. Upgrade cluster
Actual results:
Upgrade is stuck as the control plane MCP never upgrades. Manual intervention was required to delete the pod stuck in terminating to move the upgrade along
Expected results:
Upgrade should succeed without any manual intervention
Additional info:
bash-3.2$ oc project openshift-machine-api ocNow using project "openshift-machine-api" on server "https://api.test-upgrade.4scv.s1.devshift.org:6443". bash-3.2$ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-74371c0a6402ad69951f43db090a5937 False True True 3 2 2 1 17h worker rendered-worker-06da68164c0fcd25c54fc3cffc504e7d True False False 186 186 186 0 17h bash-3.2$ oc get nodes | grep control ip-10-0-129-169.us-west-2.compute.internal Ready control-plane,master 17h v1.26.3+b404935 ip-10-0-176-172.us-west-2.compute.internal Ready control-plane,master 17h v1.26.3+b404935 ip-10-0-218-240.us-west-2.compute.internal Ready,SchedulingDisabled control-plane,master 17h v1.26.3+befad9d bash-3.2$ oc describe node/ip-10-0-218-240.us-west-2.compute.internal Name: ip-10-0-218-240.us-west-2.compute.internal Roles: control-plane,master Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=m5.8xlarge beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=us-west-2 failure-domain.beta.kubernetes.io/zone=us-west-2c kubernetes.io/arch=amd64 kubernetes.io/hostname=ip-10-0-218-240.us-west-2.compute.internal kubernetes.io/os=linux node-role.kubernetes.io/control-plane= node-role.kubernetes.io/master= node.kubernetes.io/instance-type=m5.8xlarge node.openshift.io/os_id=rhcos topology.ebs.csi.aws.com/zone=us-west-2c topology.kubernetes.io/region=us-west-2 topology.kubernetes.io/zone=us-west-2c Annotations: cloud.network.openshift.io/egress-ipconfig: [{"interface":"eni-0d00e83bfcf951d97","ifaddr":{"ipv4":"10.0.192.0/19"},"capacity":{"ipv4":29,"ipv6":30}}] csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0fcda6bf3578f7407"} k8s.ovn.org/host-addresses: ["10.0.218.240"] k8s.ovn.org/l3-gateway-config: {"default":{"mode":"shared","interface-id":"br-ex_ip-10-0-218-240.us-west-2.compute.internal","mac-address":"0a:95:ef:fa:9c:17","ip-addres... k8s.ovn.org/node-chassis-id: c7c5d262-341e-481c-804a-da6b4a085e63 k8s.ovn.org/node-gateway-router-lrp-ifaddr: {"ipv4":"100.64.0.4/16"} k8s.ovn.org/node-mgmt-port-mac-address: 72:35:cc:3d:dc:90 k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.0.218.240/19"} k8s.ovn.org/node-subnets: {"default":["10.129.0.0/23"]} machine.openshift.io/machine: openshift-machine-api/test-upgrade-g9wl2-master-2 machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable machineconfiguration.openshift.io/currentConfig: rendered-master-74371c0a6402ad69951f43db090a5937 machineconfiguration.openshift.io/desiredConfig: rendered-master-bdb8565e5d621ced44f3ebd66713dc05 machineconfiguration.openshift.io/desiredDrain: drain-rendered-master-bdb8565e5d621ced44f3ebd66713dc05 machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-master-74371c0a6402ad69951f43db090a5937 machineconfiguration.openshift.io/lastSyncedControllerConfigResourceVersion: 4110931 machineconfiguration.openshift.io/reason: failed to drain node: ip-10-0-218-240.us-west-2.compute.internal after 1 hour. Please see machine-config-controller logs for more informat... machineconfiguration.openshift.io/state: Degraded volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Tue, 02 May 2023 18:39:17 -0500 Taints: node-role.kubernetes.io/master:NoSchedule node.kubernetes.io/unschedulable:NoSchedule Unschedulable: true Lease: HolderIdentity: ip-10-0-218-240.us-west-2.compute.internal AcquireTime: <unset> RenewTime: Wed, 03 May 2023 12:01:43 -0500 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure False Wed, 03 May 2023 11:58:20 -0500 Tue, 02 May 2023 19:20:49 -0500 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Wed, 03 May 2023 11:58:20 -0500 Tue, 02 May 2023 19:20:49 -0500 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Wed, 03 May 2023 11:58:20 -0500 Tue, 02 May 2023 19:20:49 -0500 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Wed, 03 May 2023 11:58:20 -0500 Tue, 02 May 2023 19:20:49 -0500 KubeletReady kubelet is posting ready status Addresses: InternalIP: 10.0.218.240 Hostname: ip-10-0-218-240.us-west-2.compute.internal InternalDNS: ip-10-0-218-240.us-west-2.compute.internal Capacity: attachable-volumes-aws-ebs: 25 cpu: 32 ephemeral-storage: 366410732Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 130397904Ki pods: 250 Allocatable: attachable-volumes-aws-ebs: 25 cpu: 31850m ephemeral-storage: 336610388229 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 120858320Ki pods: 250 System Info: Machine ID: ec21357d1e7ff0abc0f899ce50f1ed57 System UUID: ec21357d-1e7f-f0ab-c0f8-99ce50f1ed57 Boot ID: 8ed83c2e-bb8c-47cf-9a5c-8b50db65f45a Kernel Version: 5.14.0-284.10.1.el9_2.x86_64 OS Image: Red Hat Enterprise Linux CoreOS 413.92.202304140330-0 (Plow) Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.26.3-3.rhaos4.13.git641290e.el9 Kubelet Version: v1.26.3+befad9d Kube-Proxy Version: v1.26.3+befad9d ProviderID: aws:///us-west-2c/i-0fcda6bf3578f7407 Non-terminated Pods: (22 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age --------- ---- ------------ ---------- --------------- ------------- --- openshift-cluster-csi-drivers aws-ebs-csi-driver-node-hr6fx 30m (0%) 0 (0%) 150Mi (0%) 0 (0%) 158m openshift-cluster-node-tuning-operator tuned-c24fg 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 162m openshift-dns dns-default-n8nzs 60m (0%) 0 (0%) 110Mi (0%) 0 (0%) 128m openshift-dns node-resolver-9d4d8 5m (0%) 0 (0%) 21Mi (0%) 0 (0%) 134m openshift-etcd etcd-ip-10-0-218-240.us-west-2.compute.internal 360m (1%) 0 (0%) 910Mi (0%) 0 (0%) 3h9m openshift-image-registry node-ca-l58ct 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 164m openshift-kube-apiserver kube-apiserver-ip-10-0-218-240.us-west-2.compute.internal 290m (0%) 0 (0%) 1224Mi (1%) 0 (0%) 3h10m openshift-kube-controller-manager kube-controller-manager-ip-10-0-218-240.us-west-2.compute.internal 80m (0%) 0 (0%) 500Mi (0%) 0 (0%) 179m openshift-kube-scheduler openshift-kube-scheduler-ip-10-0-218-240.us-west-2.compute.internal 25m (0%) 0 (0%) 150Mi (0%) 0 (0%) 178m openshift-machine-config-operator machine-config-daemon-5rrrx 40m (0%) 0 (0%) 100Mi (0%) 0 (0%) 126m openshift-machine-config-operator machine-config-server-mgvkz 20m (0%) 0 (0%) 50Mi (0%) 0 (0%) 123m openshift-monitoring node-exporter-x8sf4 9m (0%) 0 (0%) 47Mi (0%) 0 (0%) 164m openshift-monitoring sre-dns-latency-exporter-wn8rf 0 (0%) 0 (0%) 0 (0%) 0 (0%) 16h openshift-multus multus-additional-cni-plugins-jfcwt 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 156m openshift-multus multus-zfjjh 10m (0%) 0 (0%) 65Mi (0%) 0 (0%) 159m openshift-multus network-metrics-daemon-7h52k 20m (0%) 0 (0%) 120Mi (0%) 0 (0%) 160m openshift-network-diagnostics network-check-target-2pwkk 10m (0%) 0 (0%) 15Mi (0%) 0 (0%) 159m openshift-ovn-kubernetes ovnkube-master-q2tg5 60m (0%) 0 (0%) 1520Mi (1%) 0 (0%) 140m openshift-ovn-kubernetes ovnkube-node-j4p2h 50m (0%) 0 (0%) 660Mi (0%) 0 (0%) 156m openshift-security audit-exporter-s9ms6 100m (0%) 100m (0%) 256Mi (0%) 256Mi (0%) 16h openshift-security splunkforwarder-ds-9jgfs 0 (0%) 0 (0%) 0 (0%) 0 (0%) 16h openshift-validation-webhook validation-webhook-txrkw 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3h34m Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 1199m (3%) 100m (0%) memory 5968Mi (5%) 256Mi (0%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) attachable-volumes-aws-ebs 0 0 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal RegisteredNode 5h23m node-controller Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller Normal RegisteredNode 4h53m node-controller Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller Normal RegisteredNode 4h42m node-controller Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller Normal RegisteredNode 3h42m node-controller Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller Normal RegisteredNode 3h12m node-controller Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller Normal RegisteredNode 178m node-controller Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller Normal RegisteredNode 177m node-controller Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller Normal ConfigDriftMonitorStarted 126m machineconfigdaemon Config Drift Monitor started, watching against rendered-master-74371c0a6402ad69951f43db090a5937 Normal RegisteredNode 116m node-controller Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller Normal RegisteredNode 106m node-controller Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller Normal ConfigDriftMonitorStopped 91m machineconfigdaemon Config Drift Monitor stopped Normal Cordon 91m machineconfigdaemon Cordoned node to apply update Normal Drain 91m machineconfigdaemon Draining node to update config. Normal NodeNotSchedulable 89m (x2 over 16h) kubelet Node ip-10-0-218-240.us-west-2.compute.internal status is now: NodeNotSchedulable Normal RegisteredNode 65m node-controller Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller Normal RegisteredNode 55m node-controller Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller Warning FailedToDrain 31m machineconfigdaemon failed to drain node: ip-10-0-218-240.us-west-2.compute.internal after 1 hour. Please see machine-config-controller logs for more information I0503 15:34:09.562518 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Evicted pod openshift-kube-scheduler-operator/openshift-kube-scheduler-operator-866f8c587c-js6k9 I0503 15:34:09.562576 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Waiting 1 minute then retrying. Error message from drain: [error when waiting for pod "apiserver-86f8f7df97-ctgz8" terminating: global timeout reached: 1m30s, error when waiting for pod "pod-identity-webhook-84b6dfbf4-kg9sn" terminating: global timeout reached: 1m30s, error when waiting for pod "oauth-openshift-6b595d45b4-t7vsn" terminating: global timeout reached: 1m30s, error when waiting for pod "apiserver-65c45c94d5-6rpjd" terminating: global timeout reached: 1m30s, error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s, error when waiting for pod "console-6cf648c696-gqzk6" terminating: global timeout reached: 1m30s, error when waiting for pod "multus-admission-controller-6f54b6494-8v9ws" terminating: global timeout reached: 1m30s, error when waiting for pod "managed-upgrade-operator-799b6d8974-nhbjn" terminating: global timeout reached: 1m30s] I0503 15:38:47.117907 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain I0503 15:39:01.051732 1 drain_controller.go:142] evicting pod openshift-kube-scheduler/revision-pruner-9-ip-10-0-218-240.us-west-2.compute.internal I0503 15:39:01.051766 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-218-240.us-west-2.compute.internal I0503 15:39:01.051768 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal I0503 15:39:01.051754 1 drain_controller.go:142] evicting pod openshift-etcd/revision-pruner-8-ip-10-0-218-240.us-west-2.compute.internal I0503 15:39:01.051753 1 drain_controller.go:142] evicting pod openshift-kube-apiserver/revision-pruner-13-ip-10-0-218-240.us-west-2.compute.internal I0503 15:40:16.499623 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Evicted pod openshift-kube-scheduler/revision-pruner-9-ip-10-0-218-240.us-west-2.compute.internal I0503 15:40:16.899279 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Evicted pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-218-240.us-west-2.compute.internal I0503 15:40:17.099157 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Evicted pod openshift-etcd/revision-pruner-8-ip-10-0-218-240.us-west-2.compute.internal I0503 15:40:17.301624 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Evicted pod openshift-kube-apiserver/revision-pruner-13-ip-10-0-218-240.us-west-2.compute.internal I0503 15:40:31.699793 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Waiting 1 minute then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s I0503 15:42:15.311844 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain I0503 15:42:27.003118 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal I0503 15:43:43.096534 1 request.go:682] Waited for 10.623474152s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-controller-manager/pods/installer-9-ip-10-0-218-240.us-west-2.compute.internal I0503 15:44:07.900120 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s I0503 15:48:54.508478 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain I0503 15:48:58.874832 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal I0503 15:50:32.894081 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s I0503 15:55:51.100778 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain I0503 15:56:04.770237 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-218-240.us-west-2.compute.internal I0503 15:56:04.770246 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal I0503 15:56:51.496851 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Evicted pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-218-240.us-west-2.compute.internal I0503 15:57:42.490381 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s I0503 16:01:42.703413 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain I0503 16:01:50.290563 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal I0503 16:03:22.091807 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s I0503 16:07:10.314175 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain I0503 16:07:14.619850 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal I0503 16:08:01.502029 1 request.go:682] Waited for 5.582592435s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-controller-manager/pods/installer-9-ip-10-0-218-240.us-west-2.compute.internal I0503 16:08:45.704763 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s I0503 16:10:19.314321 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain I0503 16:10:27.599135 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal I0503 16:12:04.104785 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s I0503 16:17:48.137891 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain I0503 16:18:02.467945 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal I0503 16:19:37.705623 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s I0503 16:25:28.795958 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain I0503 16:25:36.650685 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal I0503 16:27:06.905900 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s I0503 16:30:42.457954 1 node_controller.go:436] Pool master[zone=us-west-2c]: node ip-10-0-218-240.us-west-2.compute.internal: changed annotation machineconfiguration.openshift.io/state = Degraded I0503 16:30:42.457981 1 node_controller.go:436] Pool master[zone=us-west-2c]: node ip-10-0-218-240.us-west-2.compute.internal: changed annotation machineconfiguration.openshift.io/reason = failed to drain node: ip-10-0-218-240.us-west-2.compute.internal after 1 hour. Please see machine-config-controller logs for more information I0503 16:30:42.458025 1 event.go:285] Event(v1.ObjectReference{Kind:"MachineConfigPool", Namespace:"", Name:"master", UID:"458576c2-92ce-4dc9-8d74-0c9bf73e84bc", APIVersion:"machineconfiguration.openshift.io/v1", ResourceVersion:"4485386", FieldPath:""}): type: 'Normal' reason: 'AnnotationChange' Node ip-10-0-218-240.us-west-2.compute.internal now has machineconfiguration.openshift.io/state=Degraded I0503 16:30:42.458039 1 event.go:285] Event(v1.ObjectReference{Kind:"MachineConfigPool", Namespace:"", Name:"master", UID:"458576c2-92ce-4dc9-8d74-0c9bf73e84bc", APIVersion:"machineconfiguration.openshift.io/v1", ResourceVersion:"4485386", FieldPath:""}): type: 'Normal' reason: 'AnnotationChange' Node ip-10-0-218-240.us-west-2.compute.internal now has machineconfiguration.openshift.io/reason=failed to drain node: ip-10-0-218-240.us-west-2.compute.internal after 1 hour. Please see machine-config-controller logs for more information I0503 16:30:47.466109 1 status.go:108] Degraded Machine: ip-10-0-218-240.us-west-2.compute.internal and Degraded Reason: failed to drain node: ip-10-0-218-240.us-west-2.compute.internal after 1 hour. Please see machine-config-controller logs for more information I0503 16:30:52.537676 1 status.go:108] Degraded Machine: ip-10-0-218-240.us-west-2.compute.internal and Degraded Reason: failed to drain node: ip-10-0-218-240.us-west-2.compute.internal after 1 hour. Please see machine-config-controller logs for more information I0503 16:31:37.317970 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain I0503 16:31:41.908812 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal I0503 16:33:12.137370 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s E0503 16:33:12.137419 1 drain_controller.go:350] node ip-10-0-218-240.us-west-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry. I0503 16:33:12.137430 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain I0503 16:33:15.384961 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal I0503 16:34:45.408037 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s E0503 16:38:01.143850 1 drain_controller.go:350] node ip-10-0-218-240.us-west-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry. I0503 16:38:01.143864 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain I0503 16:38:04.711285 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal I0503 16:39:34.728154 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s E0503 16:43:06.693748 1 drain_controller.go:350] node ip-10-0-218-240.us-west-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry. I0503 16:43:06.693761 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain I0503 16:43:09.974369 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal I0503 16:44:39.992050 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s E0503 16:45:40.242252 1 drain_controller.go:350] node ip-10-0-218-240.us-west-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry. I0503 16:45:40.242263 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain I0503 16:45:43.846551 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal I0503 16:46:36.631592 1 status.go:108] Degraded Machine: ip-10-0-218-240.us-west-2.compute.internal and Degraded Reason: failed to drain node: ip-10-0-218-240.us-west-2.compute.internal after 1 hour. Please see machine-config-controller logs for more information I0503 16:47:13.864248 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s E0503 16:48:13.214901 1 drain_controller.go:350] node ip-10-0-218-240.us-west-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry. I0503 16:48:13.214914 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain I0503 16:48:16.382573 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal I0503 16:49:46.400574 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s E0503 16:53:19.277354 1 drain_controller.go:350] node ip-10-0-218-240.us-west-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry. I0503 16:53:19.277368 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain I0503 16:53:22.536138 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal I0503 16:54:52.552356 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s E0503 16:58:25.169846 1 drain_controller.go:350] node ip-10-0-218-240.us-west-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry. I0503 16:58:25.169861 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain I0503 16:58:28.907471 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal I0503 16:59:58.923551 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global I0503 16:59:58.923551 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s bash-3.2$ oc project openshift-kube-controller-manager Now using project "openshift-kube-controller-manager" on server "https://api.test-upgrade.4scv.s1.devshift.org:6443". (reverse-i-search)`': bash-3.2$ oc get pods NAME READY STATUS RESTARTS AGE installer-9-ip-10-0-218-240.us-west-2.compute.internal 0/1 Terminating 0 16h kube-controller-manager-guard-ip-10-0-129-169.us-west-2.compute.internal 1/1 Running 0 113m kube-controller-manager-guard-ip-10-0-176-172.us-west-2.compute.internal 1/1 Running 0 92m kube-controller-manager-ip-10-0-129-169.us-west-2.compute.internal 4/4 Running 7 (56m ago) 177m kube-controller-manager-ip-10-0-176-172.us-west-2.compute.internal 4/4 Running 4 179m kube-controller-manager-ip-10-0-218-240.us-west-2.compute.internal 4/4 Running 2 (67m ago) 3h revision-pruner-11-ip-10-0-129-169.us-west-2.compute.internal 0/1 Completed 0 121m revision-pruner-11-ip-10-0-176-172.us-west-2.compute.internal 0/1 Completed 0 101m (reverse-i-search)`de': oc describe node/ip-10-0-218-240.us-west-2.compute.internal bash-3.2$ oc describe pod/installer-9-ip-10-0-218-240.us-west-2.compute.internal Name: installer-9-ip-10-0-218-240.us-west-2.compute.internal Namespace: openshift-kube-controller-manager Priority: 2000001000 Priority Class Name: system-node-critical Service Account: installer-sa Node: ip-10-0-218-240.us-west-2.compute.internal/10.0.218.240 Start Time: Tue, 02 May 2023 19:17:31 -0500 Labels: app=installer Annotations: k8s.ovn.org/pod-networks: {"default":{"ip_addresses":["10.129.0.43/23"],"mac_address":"0a:58:0a:81:00:2b","gateway_ips":["10.129.0.1"],"ip_address":"10.129.0.43/23"... k8s.v1.cni.cncf.io/network-status: [{ "name": "ovn-kubernetes", "interface": "eth0", "ips": [ "10.129.0.43" ], "mac": "0a:58:0a:81:00:2b", "default": true, "dns": {} }] Status: Terminating (lasts 16h) Termination Grace Period: 30s IP: 10.129.0.43 IPs: IP: 10.129.0.43 Containers: installer: Container ID: cri-o://8bae6acb523c145e55b86720fed4bb81c95a8a4e1295c4c901057038c780ce55 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6ce10c67651c6bf6f12251a895b0fd8c3b1f74bd9d283e1eb4562c6cb07efff7 Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6ce10c67651c6bf6f12251a895b0fd8c3b1f74bd9d283e1eb4562c6cb07efff7 Port: <none> Host Port: <none> Command: cluster-kube-controller-manager-operator installer Args: -v=2 --revision=9 --namespace=openshift-kube-controller-manager --pod=kube-controller-manager-pod --resource-dir=/etc/kubernetes/static-pod-resources --pod-manifest-dir=/etc/kubernetes/manifests --configmaps=kube-controller-manager-pod --configmaps=config --configmaps=cluster-policy-controller-config --configmaps=controller-manager-kubeconfig --optional-configmaps=cloud-config --configmaps=kube-controller-cert-syncer-kubeconfig --configmaps=serviceaccount-ca --configmaps=service-ca --configmaps=recycler-config --secrets=service-account-private-key --optional-secrets=serving-cert --secrets=localhost-recovery-client-token --cert-dir=/etc/kubernetes/static-pod-resources/kube-controller-manager-certs --cert-configmaps=aggregator-client-ca --cert-configmaps=client-ca --optional-cert-configmaps=trusted-ca-bundle --cert-secrets=kube-controller-manager-client-cert-key --cert-secrets=csr-signer State: Terminated Reason: Error Message: 0] Creating target resource directory "/etc/kubernetes/static-pod-resources/kube-controller-manager-pod-9" ... I0503 00:18:04.654371 1 cmd.go:218] Creating target resource directory "/etc/kubernetes/static-pod-resources/kube-controller-manager-pod-9" ... I0503 00:18:04.654385 1 cmd.go:226] Getting secrets ... I0503 00:18:04.657165 1 copy.go:32] Got secret openshift-kube-controller-manager/localhost-recovery-client-token-9 I0503 00:18:04.659036 1 copy.go:32] Got secret openshift-kube-controller-manager/service-account-private-key-9 I0503 00:18:04.729178 1 copy.go:32] Got secret openshift-kube-controller-manager/serving-cert-9 I0503 00:18:04.729221 1 cmd.go:239] Getting config maps ... I0503 00:18:04.731598 1 copy.go:60] Got configMap openshift-kube-controller-manager/cluster-policy-controller-config-9 I0503 00:18:04.733267 1 copy.go:60] Got configMap openshift-kube-controller-manager/config-9 I0503 00:18:04.734877 1 copy.go:60] Got configMap openshift-kube-controller-manager/controller-manager-kubeconfig-9 I0503 00:18:04.738125 1 copy.go:60] Got configMap openshift-kube-controller-manager/kube-controller-cert-syncer-kubeconfig-9 I0503 00:18:04.740282 1 copy.go:60] Got configMap openshift-kube-controller-manager/kube-controller-manager-pod-9 I0503 00:18:04.850337 1 copy.go:60] Got configMap openshift-kube-controller-manager/recycler-config-9 I0503 00:18:05.052508 1 copy.go:60] Got configMap openshift-kube-controller-manager/service-ca-9 I0503 00:18:05.253415 1 copy.go:60] Got configMap openshift-kube-controller-manager/serviceaccount-ca-9 I0503 00:18:05.291982 1 cmd.go:124] Received SIGTERM or SIGINT signal, shutting down the process. I0503 00:18:05.292067 1 copy.go:52] Failed to get config map openshift-kube-controller-manager/cloud-config-9: client rate limiter Wait returned an error: context canceled F0503 00:18:05.451745 1 cmd.go:106] failed to copy: client rate limiter Wait returned an error: context canceled Exit Code: 1 Started: Tue, 02 May 2023 19:17:34 -0500 Finished: Tue, 02 May 2023 19:18:05 -0500 Ready: False Restart Count: 0 Limits: cpu: 150m memory: 200M Requests: cpu: 150m memory: 200M Environment: POD_NAME: installer-9-ip-10-0-218-240.us-west-2.compute.internal (v1:metadata.name) NODE_NAME: (v1:spec.nodeName) Mounts: /etc/kubernetes/ from kubelet-dir (rw) /var/lock from var-lock (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access (ro) Conditions: Type Status DisruptionTarget True Initialized True Ready False ContainersReady False PodScheduled True Volumes: kubelet-dir: Type: HostPath (bare host directory volume) Path: /etc/kubernetes/ HostPathType: var-lock: Type: HostPath (bare host directory volume) Path: /var/lock HostPathType: kube-api-access: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3600 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true QoS Class: Guaranteed Node-Selectors: <none> Tolerations: op=Exists