-
Bug
-
Resolution: Done
-
Normal
-
None
-
4.13
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
No
-
None
-
None
-
None
-
OCPNODE Sprint 237 (Blue), OCPNODE Sprint 238 (Blue)
-
2
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
While trying to upgrade a loaded cluster at 120 nodes (ROSA) one of the control plane nodes fails to drain causing the upgrade to be stuck.
Version-Release number of selected component (if applicable):
4.13.0-rc.4 to 4.13.0-rc6
How reproducible:
Happened on one attempt
Steps to Reproduce:
1. Install a 120 node cluster 2. Load up the cluster using cluster-density-v1 with ITERATIONS=4000 and gc=false (https://github.com/cloud-bulldozer/e2e-benchmarking/tree/master/workloads/kube-burner-ocp-wrapper) 3. Upgrade cluster
Actual results:
Upgrade is stuck as the control plane MCP never upgrades. Manual intervention was required to delete the pod stuck in terminating to move the upgrade along
Expected results:
Upgrade should succeed without any manual intervention
Additional info:
bash-3.2$ oc project openshift-machine-api
ocNow using project "openshift-machine-api" on server "https://api.test-upgrade.4scv.s1.devshift.org:6443".
bash-3.2$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-74371c0a6402ad69951f43db090a5937 False True True 3 2 2 1 17h
worker rendered-worker-06da68164c0fcd25c54fc3cffc504e7d True False False 186 186 186 0 17h
bash-3.2$ oc get nodes | grep control
ip-10-0-129-169.us-west-2.compute.internal Ready control-plane,master 17h v1.26.3+b404935
ip-10-0-176-172.us-west-2.compute.internal Ready control-plane,master 17h v1.26.3+b404935
ip-10-0-218-240.us-west-2.compute.internal Ready,SchedulingDisabled control-plane,master 17h v1.26.3+befad9d
bash-3.2$ oc describe node/ip-10-0-218-240.us-west-2.compute.internal
Name: ip-10-0-218-240.us-west-2.compute.internal
Roles: control-plane,master
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=m5.8xlarge
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=us-west-2
failure-domain.beta.kubernetes.io/zone=us-west-2c
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-10-0-218-240.us-west-2.compute.internal
kubernetes.io/os=linux
node-role.kubernetes.io/control-plane=
node-role.kubernetes.io/master=
node.kubernetes.io/instance-type=m5.8xlarge
node.openshift.io/os_id=rhcos
topology.ebs.csi.aws.com/zone=us-west-2c
topology.kubernetes.io/region=us-west-2
topology.kubernetes.io/zone=us-west-2c
Annotations: cloud.network.openshift.io/egress-ipconfig:
[{"interface":"eni-0d00e83bfcf951d97","ifaddr":{"ipv4":"10.0.192.0/19"},"capacity":{"ipv4":29,"ipv6":30}}]
csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0fcda6bf3578f7407"}
k8s.ovn.org/host-addresses: ["10.0.218.240"]
k8s.ovn.org/l3-gateway-config:
{"default":{"mode":"shared","interface-id":"br-ex_ip-10-0-218-240.us-west-2.compute.internal","mac-address":"0a:95:ef:fa:9c:17","ip-addres...
k8s.ovn.org/node-chassis-id: c7c5d262-341e-481c-804a-da6b4a085e63
k8s.ovn.org/node-gateway-router-lrp-ifaddr: {"ipv4":"100.64.0.4/16"}
k8s.ovn.org/node-mgmt-port-mac-address: 72:35:cc:3d:dc:90
k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.0.218.240/19"}
k8s.ovn.org/node-subnets: {"default":["10.129.0.0/23"]}
machine.openshift.io/machine: openshift-machine-api/test-upgrade-g9wl2-master-2
machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
machineconfiguration.openshift.io/currentConfig: rendered-master-74371c0a6402ad69951f43db090a5937
machineconfiguration.openshift.io/desiredConfig: rendered-master-bdb8565e5d621ced44f3ebd66713dc05
machineconfiguration.openshift.io/desiredDrain: drain-rendered-master-bdb8565e5d621ced44f3ebd66713dc05
machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-master-74371c0a6402ad69951f43db090a5937
machineconfiguration.openshift.io/lastSyncedControllerConfigResourceVersion: 4110931
machineconfiguration.openshift.io/reason:
failed to drain node: ip-10-0-218-240.us-west-2.compute.internal after 1 hour. Please see machine-config-controller logs for more informat...
machineconfiguration.openshift.io/state: Degraded
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Tue, 02 May 2023 18:39:17 -0500
Taints: node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/unschedulable:NoSchedule
Unschedulable: true
Lease:
HolderIdentity: ip-10-0-218-240.us-west-2.compute.internal
AcquireTime: <unset>
RenewTime: Wed, 03 May 2023 12:01:43 -0500
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Wed, 03 May 2023 11:58:20 -0500 Tue, 02 May 2023 19:20:49 -0500 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 03 May 2023 11:58:20 -0500 Tue, 02 May 2023 19:20:49 -0500 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Wed, 03 May 2023 11:58:20 -0500 Tue, 02 May 2023 19:20:49 -0500 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Wed, 03 May 2023 11:58:20 -0500 Tue, 02 May 2023 19:20:49 -0500 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 10.0.218.240
Hostname: ip-10-0-218-240.us-west-2.compute.internal
InternalDNS: ip-10-0-218-240.us-west-2.compute.internal
Capacity:
attachable-volumes-aws-ebs: 25
cpu: 32
ephemeral-storage: 366410732Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 130397904Ki
pods: 250
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 31850m
ephemeral-storage: 336610388229
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 120858320Ki
pods: 250
System Info:
Machine ID: ec21357d1e7ff0abc0f899ce50f1ed57
System UUID: ec21357d-1e7f-f0ab-c0f8-99ce50f1ed57
Boot ID: 8ed83c2e-bb8c-47cf-9a5c-8b50db65f45a
Kernel Version: 5.14.0-284.10.1.el9_2.x86_64
OS Image: Red Hat Enterprise Linux CoreOS 413.92.202304140330-0 (Plow)
Operating System: linux
Architecture: amd64
Container Runtime Version: cri-o://1.26.3-3.rhaos4.13.git641290e.el9
Kubelet Version: v1.26.3+befad9d
Kube-Proxy Version: v1.26.3+befad9d
ProviderID: aws:///us-west-2c/i-0fcda6bf3578f7407
Non-terminated Pods: (22 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
openshift-cluster-csi-drivers aws-ebs-csi-driver-node-hr6fx 30m (0%) 0 (0%) 150Mi (0%) 0 (0%) 158m
openshift-cluster-node-tuning-operator tuned-c24fg 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 162m
openshift-dns dns-default-n8nzs 60m (0%) 0 (0%) 110Mi (0%) 0 (0%) 128m
openshift-dns node-resolver-9d4d8 5m (0%) 0 (0%) 21Mi (0%) 0 (0%) 134m
openshift-etcd etcd-ip-10-0-218-240.us-west-2.compute.internal 360m (1%) 0 (0%) 910Mi (0%) 0 (0%) 3h9m
openshift-image-registry node-ca-l58ct 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 164m
openshift-kube-apiserver kube-apiserver-ip-10-0-218-240.us-west-2.compute.internal 290m (0%) 0 (0%) 1224Mi (1%) 0 (0%) 3h10m
openshift-kube-controller-manager kube-controller-manager-ip-10-0-218-240.us-west-2.compute.internal 80m (0%) 0 (0%) 500Mi (0%) 0 (0%) 179m
openshift-kube-scheduler openshift-kube-scheduler-ip-10-0-218-240.us-west-2.compute.internal 25m (0%) 0 (0%) 150Mi (0%) 0 (0%) 178m
openshift-machine-config-operator machine-config-daemon-5rrrx 40m (0%) 0 (0%) 100Mi (0%) 0 (0%) 126m
openshift-machine-config-operator machine-config-server-mgvkz 20m (0%) 0 (0%) 50Mi (0%) 0 (0%) 123m
openshift-monitoring node-exporter-x8sf4 9m (0%) 0 (0%) 47Mi (0%) 0 (0%) 164m
openshift-monitoring sre-dns-latency-exporter-wn8rf 0 (0%) 0 (0%) 0 (0%) 0 (0%) 16h
openshift-multus multus-additional-cni-plugins-jfcwt 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 156m
openshift-multus multus-zfjjh 10m (0%) 0 (0%) 65Mi (0%) 0 (0%) 159m
openshift-multus network-metrics-daemon-7h52k 20m (0%) 0 (0%) 120Mi (0%) 0 (0%) 160m
openshift-network-diagnostics network-check-target-2pwkk 10m (0%) 0 (0%) 15Mi (0%) 0 (0%) 159m
openshift-ovn-kubernetes ovnkube-master-q2tg5 60m (0%) 0 (0%) 1520Mi (1%) 0 (0%) 140m
openshift-ovn-kubernetes ovnkube-node-j4p2h 50m (0%) 0 (0%) 660Mi (0%) 0 (0%) 156m
openshift-security audit-exporter-s9ms6 100m (0%) 100m (0%) 256Mi (0%) 256Mi (0%) 16h
openshift-security splunkforwarder-ds-9jgfs 0 (0%) 0 (0%) 0 (0%) 0 (0%) 16h
openshift-validation-webhook validation-webhook-txrkw 0 (0%) 0 (0%) 0 (0%) 0 (0%) 3h34m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1199m (3%) 100m (0%)
memory 5968Mi (5%) 256Mi (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal RegisteredNode 5h23m node-controller Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
Normal RegisteredNode 4h53m node-controller Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
Normal RegisteredNode 4h42m node-controller Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
Normal RegisteredNode 3h42m node-controller Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
Normal RegisteredNode 3h12m node-controller Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
Normal RegisteredNode 178m node-controller Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
Normal RegisteredNode 177m node-controller Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
Normal ConfigDriftMonitorStarted 126m machineconfigdaemon Config Drift Monitor started, watching against rendered-master-74371c0a6402ad69951f43db090a5937
Normal RegisteredNode 116m node-controller Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
Normal RegisteredNode 106m node-controller Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
Normal ConfigDriftMonitorStopped 91m machineconfigdaemon Config Drift Monitor stopped
Normal Cordon 91m machineconfigdaemon Cordoned node to apply update
Normal Drain 91m machineconfigdaemon Draining node to update config.
Normal NodeNotSchedulable 89m (x2 over 16h) kubelet Node ip-10-0-218-240.us-west-2.compute.internal status is now: NodeNotSchedulable
Normal RegisteredNode 65m node-controller Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
Normal RegisteredNode 55m node-controller Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
Warning FailedToDrain 31m machineconfigdaemon failed to drain node: ip-10-0-218-240.us-west-2.compute.internal after 1 hour. Please see machine-config-controller logs for more information
I0503 15:34:09.562518 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Evicted pod openshift-kube-scheduler-operator/openshift-kube-scheduler-operator-866f8c587c-js6k9
I0503 15:34:09.562576 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Waiting 1 minute then retrying. Error message from drain: [error when waiting for pod "apiserver-86f8f7df97-ctgz8" terminating: global timeout reached: 1m30s, error when waiting for pod "pod-identity-webhook-84b6dfbf4-kg9sn" terminating: global timeout reached: 1m30s, error when waiting for pod "oauth-openshift-6b595d45b4-t7vsn" terminating: global timeout reached: 1m30s, error when waiting for pod "apiserver-65c45c94d5-6rpjd" terminating: global timeout reached: 1m30s, error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s, error when waiting for pod "console-6cf648c696-gqzk6" terminating: global timeout reached: 1m30s, error when waiting for pod "multus-admission-controller-6f54b6494-8v9ws" terminating: global timeout reached: 1m30s, error when waiting for pod "managed-upgrade-operator-799b6d8974-nhbjn" terminating: global timeout reached: 1m30s]
I0503 15:38:47.117907 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 15:39:01.051732 1 drain_controller.go:142] evicting pod openshift-kube-scheduler/revision-pruner-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 15:39:01.051766 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-218-240.us-west-2.compute.internal
I0503 15:39:01.051768 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 15:39:01.051754 1 drain_controller.go:142] evicting pod openshift-etcd/revision-pruner-8-ip-10-0-218-240.us-west-2.compute.internal
I0503 15:39:01.051753 1 drain_controller.go:142] evicting pod openshift-kube-apiserver/revision-pruner-13-ip-10-0-218-240.us-west-2.compute.internal
I0503 15:40:16.499623 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Evicted pod openshift-kube-scheduler/revision-pruner-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 15:40:16.899279 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Evicted pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-218-240.us-west-2.compute.internal
I0503 15:40:17.099157 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Evicted pod openshift-etcd/revision-pruner-8-ip-10-0-218-240.us-west-2.compute.internal
I0503 15:40:17.301624 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Evicted pod openshift-kube-apiserver/revision-pruner-13-ip-10-0-218-240.us-west-2.compute.internal
I0503 15:40:31.699793 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Waiting 1 minute then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
I0503 15:42:15.311844 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 15:42:27.003118 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 15:43:43.096534 1 request.go:682] Waited for 10.623474152s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-controller-manager/pods/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 15:44:07.900120 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
I0503 15:48:54.508478 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 15:48:58.874832 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 15:50:32.894081 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
I0503 15:55:51.100778 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 15:56:04.770237 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-218-240.us-west-2.compute.internal
I0503 15:56:04.770246 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 15:56:51.496851 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Evicted pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-218-240.us-west-2.compute.internal
I0503 15:57:42.490381 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
I0503 16:01:42.703413 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 16:01:50.290563 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 16:03:22.091807 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
I0503 16:07:10.314175 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 16:07:14.619850 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 16:08:01.502029 1 request.go:682] Waited for 5.582592435s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-controller-manager/pods/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 16:08:45.704763 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
I0503 16:10:19.314321 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 16:10:27.599135 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 16:12:04.104785 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
I0503 16:17:48.137891 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 16:18:02.467945 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 16:19:37.705623 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
I0503 16:25:28.795958 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 16:25:36.650685 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 16:27:06.905900 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
I0503 16:30:42.457954 1 node_controller.go:436] Pool master[zone=us-west-2c]: node ip-10-0-218-240.us-west-2.compute.internal: changed annotation machineconfiguration.openshift.io/state = Degraded
I0503 16:30:42.457981 1 node_controller.go:436] Pool master[zone=us-west-2c]: node ip-10-0-218-240.us-west-2.compute.internal: changed annotation machineconfiguration.openshift.io/reason = failed to drain node: ip-10-0-218-240.us-west-2.compute.internal after 1 hour. Please see machine-config-controller logs for more information
I0503 16:30:42.458025 1 event.go:285] Event(v1.ObjectReference{Kind:"MachineConfigPool", Namespace:"", Name:"master", UID:"458576c2-92ce-4dc9-8d74-0c9bf73e84bc", APIVersion:"machineconfiguration.openshift.io/v1", ResourceVersion:"4485386", FieldPath:""}): type: 'Normal' reason: 'AnnotationChange' Node ip-10-0-218-240.us-west-2.compute.internal now has machineconfiguration.openshift.io/state=Degraded
I0503 16:30:42.458039 1 event.go:285] Event(v1.ObjectReference{Kind:"MachineConfigPool", Namespace:"", Name:"master", UID:"458576c2-92ce-4dc9-8d74-0c9bf73e84bc", APIVersion:"machineconfiguration.openshift.io/v1", ResourceVersion:"4485386", FieldPath:""}): type: 'Normal' reason: 'AnnotationChange' Node ip-10-0-218-240.us-west-2.compute.internal now has machineconfiguration.openshift.io/reason=failed to drain node: ip-10-0-218-240.us-west-2.compute.internal after 1 hour. Please see machine-config-controller logs for more information
I0503 16:30:47.466109 1 status.go:108] Degraded Machine: ip-10-0-218-240.us-west-2.compute.internal and Degraded Reason: failed to drain node: ip-10-0-218-240.us-west-2.compute.internal after 1 hour. Please see machine-config-controller logs for more information
I0503 16:30:52.537676 1 status.go:108] Degraded Machine: ip-10-0-218-240.us-west-2.compute.internal and Degraded Reason: failed to drain node: ip-10-0-218-240.us-west-2.compute.internal after 1 hour. Please see machine-config-controller logs for more information
I0503 16:31:37.317970 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 16:31:41.908812 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 16:33:12.137370 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
E0503 16:33:12.137419 1 drain_controller.go:350] node ip-10-0-218-240.us-west-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry.
I0503 16:33:12.137430 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 16:33:15.384961 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 16:34:45.408037 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
E0503 16:38:01.143850 1 drain_controller.go:350] node ip-10-0-218-240.us-west-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry.
I0503 16:38:01.143864 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 16:38:04.711285 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 16:39:34.728154 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
E0503 16:43:06.693748 1 drain_controller.go:350] node ip-10-0-218-240.us-west-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry.
I0503 16:43:06.693761 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 16:43:09.974369 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 16:44:39.992050 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
E0503 16:45:40.242252 1 drain_controller.go:350] node ip-10-0-218-240.us-west-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry.
I0503 16:45:40.242263 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 16:45:43.846551 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 16:46:36.631592 1 status.go:108] Degraded Machine: ip-10-0-218-240.us-west-2.compute.internal and Degraded Reason: failed to drain node: ip-10-0-218-240.us-west-2.compute.internal after 1 hour. Please see machine-config-controller logs for more information
I0503 16:47:13.864248 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
E0503 16:48:13.214901 1 drain_controller.go:350] node ip-10-0-218-240.us-west-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry.
I0503 16:48:13.214914 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 16:48:16.382573 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 16:49:46.400574 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
E0503 16:53:19.277354 1 drain_controller.go:350] node ip-10-0-218-240.us-west-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry.
I0503 16:53:19.277368 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 16:53:22.536138 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 16:54:52.552356 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
E0503 16:58:25.169846 1 drain_controller.go:350] node ip-10-0-218-240.us-west-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry.
I0503 16:58:25.169861 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 16:58:28.907471 1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 16:59:58.923551 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global
I0503 16:59:58.923551 1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
bash-3.2$ oc project openshift-kube-controller-manager
Now using project "openshift-kube-controller-manager" on server "https://api.test-upgrade.4scv.s1.devshift.org:6443".
(reverse-i-search)`':
bash-3.2$ oc get pods
NAME READY STATUS RESTARTS AGE
installer-9-ip-10-0-218-240.us-west-2.compute.internal 0/1 Terminating 0 16h
kube-controller-manager-guard-ip-10-0-129-169.us-west-2.compute.internal 1/1 Running 0 113m
kube-controller-manager-guard-ip-10-0-176-172.us-west-2.compute.internal 1/1 Running 0 92m
kube-controller-manager-ip-10-0-129-169.us-west-2.compute.internal 4/4 Running 7 (56m ago) 177m
kube-controller-manager-ip-10-0-176-172.us-west-2.compute.internal 4/4 Running 4 179m
kube-controller-manager-ip-10-0-218-240.us-west-2.compute.internal 4/4 Running 2 (67m ago) 3h
revision-pruner-11-ip-10-0-129-169.us-west-2.compute.internal 0/1 Completed 0 121m
revision-pruner-11-ip-10-0-176-172.us-west-2.compute.internal 0/1 Completed 0 101m
(reverse-i-search)`de': oc describe node/ip-10-0-218-240.us-west-2.compute.internal
bash-3.2$ oc describe pod/installer-9-ip-10-0-218-240.us-west-2.compute.internal
Name: installer-9-ip-10-0-218-240.us-west-2.compute.internal
Namespace: openshift-kube-controller-manager
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: installer-sa
Node: ip-10-0-218-240.us-west-2.compute.internal/10.0.218.240
Start Time: Tue, 02 May 2023 19:17:31 -0500
Labels: app=installer
Annotations: k8s.ovn.org/pod-networks:
{"default":{"ip_addresses":["10.129.0.43/23"],"mac_address":"0a:58:0a:81:00:2b","gateway_ips":["10.129.0.1"],"ip_address":"10.129.0.43/23"...
k8s.v1.cni.cncf.io/network-status:
[{
"name": "ovn-kubernetes",
"interface": "eth0",
"ips": [
"10.129.0.43"
],
"mac": "0a:58:0a:81:00:2b",
"default": true,
"dns": {}
}]
Status: Terminating (lasts 16h)
Termination Grace Period: 30s
IP: 10.129.0.43
IPs:
IP: 10.129.0.43
Containers:
installer:
Container ID: cri-o://8bae6acb523c145e55b86720fed4bb81c95a8a4e1295c4c901057038c780ce55
Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6ce10c67651c6bf6f12251a895b0fd8c3b1f74bd9d283e1eb4562c6cb07efff7
Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6ce10c67651c6bf6f12251a895b0fd8c3b1f74bd9d283e1eb4562c6cb07efff7
Port: <none>
Host Port: <none>
Command:
cluster-kube-controller-manager-operator
installer
Args:
-v=2
--revision=9
--namespace=openshift-kube-controller-manager
--pod=kube-controller-manager-pod
--resource-dir=/etc/kubernetes/static-pod-resources
--pod-manifest-dir=/etc/kubernetes/manifests
--configmaps=kube-controller-manager-pod
--configmaps=config
--configmaps=cluster-policy-controller-config
--configmaps=controller-manager-kubeconfig
--optional-configmaps=cloud-config
--configmaps=kube-controller-cert-syncer-kubeconfig
--configmaps=serviceaccount-ca
--configmaps=service-ca
--configmaps=recycler-config
--secrets=service-account-private-key
--optional-secrets=serving-cert
--secrets=localhost-recovery-client-token
--cert-dir=/etc/kubernetes/static-pod-resources/kube-controller-manager-certs
--cert-configmaps=aggregator-client-ca
--cert-configmaps=client-ca
--optional-cert-configmaps=trusted-ca-bundle
--cert-secrets=kube-controller-manager-client-cert-key
--cert-secrets=csr-signer
State: Terminated
Reason: Error
Message: 0] Creating target resource directory "/etc/kubernetes/static-pod-resources/kube-controller-manager-pod-9" ...
I0503 00:18:04.654371 1 cmd.go:218] Creating target resource directory "/etc/kubernetes/static-pod-resources/kube-controller-manager-pod-9" ...
I0503 00:18:04.654385 1 cmd.go:226] Getting secrets ...
I0503 00:18:04.657165 1 copy.go:32] Got secret openshift-kube-controller-manager/localhost-recovery-client-token-9
I0503 00:18:04.659036 1 copy.go:32] Got secret openshift-kube-controller-manager/service-account-private-key-9
I0503 00:18:04.729178 1 copy.go:32] Got secret openshift-kube-controller-manager/serving-cert-9
I0503 00:18:04.729221 1 cmd.go:239] Getting config maps ...
I0503 00:18:04.731598 1 copy.go:60] Got configMap openshift-kube-controller-manager/cluster-policy-controller-config-9
I0503 00:18:04.733267 1 copy.go:60] Got configMap openshift-kube-controller-manager/config-9
I0503 00:18:04.734877 1 copy.go:60] Got configMap openshift-kube-controller-manager/controller-manager-kubeconfig-9
I0503 00:18:04.738125 1 copy.go:60] Got configMap openshift-kube-controller-manager/kube-controller-cert-syncer-kubeconfig-9
I0503 00:18:04.740282 1 copy.go:60] Got configMap openshift-kube-controller-manager/kube-controller-manager-pod-9
I0503 00:18:04.850337 1 copy.go:60] Got configMap openshift-kube-controller-manager/recycler-config-9
I0503 00:18:05.052508 1 copy.go:60] Got configMap openshift-kube-controller-manager/service-ca-9
I0503 00:18:05.253415 1 copy.go:60] Got configMap openshift-kube-controller-manager/serviceaccount-ca-9
I0503 00:18:05.291982 1 cmd.go:124] Received SIGTERM or SIGINT signal, shutting down the process.
I0503 00:18:05.292067 1 copy.go:52] Failed to get config map openshift-kube-controller-manager/cloud-config-9: client rate limiter Wait returned an error: context canceled
F0503 00:18:05.451745 1 cmd.go:106] failed to copy: client rate limiter Wait returned an error: context canceled
Exit Code: 1
Started: Tue, 02 May 2023 19:17:34 -0500
Finished: Tue, 02 May 2023 19:18:05 -0500
Ready: False
Restart Count: 0
Limits:
cpu: 150m
memory: 200M
Requests:
cpu: 150m
memory: 200M
Environment:
POD_NAME: installer-9-ip-10-0-218-240.us-west-2.compute.internal (v1:metadata.name)
NODE_NAME: (v1:spec.nodeName)
Mounts:
/etc/kubernetes/ from kubelet-dir (rw)
/var/lock from var-lock (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access (ro)
Conditions:
Type Status
DisruptionTarget True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kubelet-dir:
Type: HostPath (bare host directory volume)
Path: /etc/kubernetes/
HostPathType:
var-lock:
Type: HostPath (bare host directory volume)
Path: /var/lock
HostPathType:
kube-api-access:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3600
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: op=Exists