Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.13
Component/s: Node / Kubelet
Labels:
- triaged

Regression:
No
Sprint:
OCPNODE Sprint 237 (Blue), OCPNODE Sprint 238 (Blue)
sprint_count:
2
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Links:

Description

Description of problem:

While trying to upgrade a loaded cluster at 120 nodes (ROSA) one of the control plane nodes fails to drain causing the upgrade to be stuck.

Version-Release number of selected component (if applicable):

4.13.0-rc.4 to 4.13.0-rc6

How reproducible:

Happened on one attempt

Steps to Reproduce:

1. Install a 120 node cluster
2.  Load up the cluster using cluster-density-v1 with ITERATIONS=4000 and gc=false (https://github.com/cloud-bulldozer/e2e-benchmarking/tree/master/workloads/kube-burner-ocp-wrapper) 3. Upgrade cluster

Actual results:

Upgrade is stuck as the control plane MCP never upgrades. Manual intervention was required to delete the pod stuck in terminating to move the upgrade along

Expected results:

Upgrade should succeed without any manual intervention

Additional info:

bash-3.2$ oc project openshift-machine-api
ocNow using project "openshift-machine-api" on server "https://api.test-upgrade.4scv.s1.devshift.org:6443".
bash-3.2$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-74371c0a6402ad69951f43db090a5937   False     True       True       3              2                   2                     1                      17h
worker   rendered-worker-06da68164c0fcd25c54fc3cffc504e7d   True      False      False      186            186                 186                   0                      17h

bash-3.2$ oc get nodes | grep control
ip-10-0-129-169.us-west-2.compute.internal   Ready                      control-plane,master   17h     v1.26.3+b404935
ip-10-0-176-172.us-west-2.compute.internal   Ready                      control-plane,master   17h     v1.26.3+b404935
ip-10-0-218-240.us-west-2.compute.internal   Ready,SchedulingDisabled   control-plane,master   17h     v1.26.3+befad9d

bash-3.2$ oc describe node/ip-10-0-218-240.us-west-2.compute.internal
Name:               ip-10-0-218-240.us-west-2.compute.internal
Roles:              control-plane,master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=m5.8xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-west-2
                    failure-domain.beta.kubernetes.io/zone=us-west-2c
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-0-218-240.us-west-2.compute.internal
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=
                    node-role.kubernetes.io/master=
                    node.kubernetes.io/instance-type=m5.8xlarge
                    node.openshift.io/os_id=rhcos
                    topology.ebs.csi.aws.com/zone=us-west-2c
                    topology.kubernetes.io/region=us-west-2
                    topology.kubernetes.io/zone=us-west-2c
Annotations:        cloud.network.openshift.io/egress-ipconfig:
                      [{"interface":"eni-0d00e83bfcf951d97","ifaddr":{"ipv4":"10.0.192.0/19"},"capacity":{"ipv4":29,"ipv6":30}}]
                    csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0fcda6bf3578f7407"}
                    k8s.ovn.org/host-addresses: ["10.0.218.240"]
                    k8s.ovn.org/l3-gateway-config:
                      {"default":{"mode":"shared","interface-id":"br-ex_ip-10-0-218-240.us-west-2.compute.internal","mac-address":"0a:95:ef:fa:9c:17","ip-addres...
                    k8s.ovn.org/node-chassis-id: c7c5d262-341e-481c-804a-da6b4a085e63
                    k8s.ovn.org/node-gateway-router-lrp-ifaddr: {"ipv4":"100.64.0.4/16"}
                    k8s.ovn.org/node-mgmt-port-mac-address: 72:35:cc:3d:dc:90
                    k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.0.218.240/19"}
                    k8s.ovn.org/node-subnets: {"default":["10.129.0.0/23"]}
                    machine.openshift.io/machine: openshift-machine-api/test-upgrade-g9wl2-master-2
                    machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
                    machineconfiguration.openshift.io/currentConfig: rendered-master-74371c0a6402ad69951f43db090a5937
                    machineconfiguration.openshift.io/desiredConfig: rendered-master-bdb8565e5d621ced44f3ebd66713dc05
                    machineconfiguration.openshift.io/desiredDrain: drain-rendered-master-bdb8565e5d621ced44f3ebd66713dc05
                    machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-master-74371c0a6402ad69951f43db090a5937
                    machineconfiguration.openshift.io/lastSyncedControllerConfigResourceVersion: 4110931
                    machineconfiguration.openshift.io/reason:
                      failed to drain node: ip-10-0-218-240.us-west-2.compute.internal after 1 hour. Please see machine-config-controller logs for more informat...
                    machineconfiguration.openshift.io/state: Degraded
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 02 May 2023 18:39:17 -0500
Taints:             node-role.kubernetes.io/master:NoSchedule
                    node.kubernetes.io/unschedulable:NoSchedule
Unschedulable:      true
Lease:
  HolderIdentity:  ip-10-0-218-240.us-west-2.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Wed, 03 May 2023 12:01:43 -0500
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Wed, 03 May 2023 11:58:20 -0500   Tue, 02 May 2023 19:20:49 -0500   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 03 May 2023 11:58:20 -0500   Tue, 02 May 2023 19:20:49 -0500   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 03 May 2023 11:58:20 -0500   Tue, 02 May 2023 19:20:49 -0500   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 03 May 2023 11:58:20 -0500   Tue, 02 May 2023 19:20:49 -0500   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   10.0.218.240
  Hostname:     ip-10-0-218-240.us-west-2.compute.internal
  InternalDNS:  ip-10-0-218-240.us-west-2.compute.internal
Capacity:
  attachable-volumes-aws-ebs:  25
  cpu:                         32
  ephemeral-storage:           366410732Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      130397904Ki
  pods:                        250
Allocatable:
  attachable-volumes-aws-ebs:  25
  cpu:                         31850m
  ephemeral-storage:           336610388229
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      120858320Ki
  pods:                        250
System Info:
  Machine ID:                             ec21357d1e7ff0abc0f899ce50f1ed57
  System UUID:                            ec21357d-1e7f-f0ab-c0f8-99ce50f1ed57
  Boot ID:                                8ed83c2e-bb8c-47cf-9a5c-8b50db65f45a
  Kernel Version:                         5.14.0-284.10.1.el9_2.x86_64
  OS Image:                               Red Hat Enterprise Linux CoreOS 413.92.202304140330-0 (Plow)
  Operating System:                       linux
  Architecture:                           amd64
  Container Runtime Version:              cri-o://1.26.3-3.rhaos4.13.git641290e.el9
  Kubelet Version:                        v1.26.3+befad9d
  Kube-Proxy Version:                     v1.26.3+befad9d
ProviderID:                               aws:///us-west-2c/i-0fcda6bf3578f7407
Non-terminated Pods:                      (22 in total)
  Namespace                               Name                                                                   CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                               ----                                                                   ------------  ----------  ---------------  -------------  ---
  openshift-cluster-csi-drivers           aws-ebs-csi-driver-node-hr6fx                                          30m (0%)      0 (0%)      150Mi (0%)       0 (0%)         158m
  openshift-cluster-node-tuning-operator  tuned-c24fg                                                            10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         162m
  openshift-dns                           dns-default-n8nzs                                                      60m (0%)      0 (0%)      110Mi (0%)       0 (0%)         128m
  openshift-dns                           node-resolver-9d4d8                                                    5m (0%)       0 (0%)      21Mi (0%)        0 (0%)         134m
  openshift-etcd                          etcd-ip-10-0-218-240.us-west-2.compute.internal                        360m (1%)     0 (0%)      910Mi (0%)       0 (0%)         3h9m
  openshift-image-registry                node-ca-l58ct                                                          10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         164m
  openshift-kube-apiserver                kube-apiserver-ip-10-0-218-240.us-west-2.compute.internal              290m (0%)     0 (0%)      1224Mi (1%)      0 (0%)         3h10m
  openshift-kube-controller-manager       kube-controller-manager-ip-10-0-218-240.us-west-2.compute.internal     80m (0%)      0 (0%)      500Mi (0%)       0 (0%)         179m
  openshift-kube-scheduler                openshift-kube-scheduler-ip-10-0-218-240.us-west-2.compute.internal    25m (0%)      0 (0%)      150Mi (0%)       0 (0%)         178m
  openshift-machine-config-operator       machine-config-daemon-5rrrx                                            40m (0%)      0 (0%)      100Mi (0%)       0 (0%)         126m
  openshift-machine-config-operator       machine-config-server-mgvkz                                            20m (0%)      0 (0%)      50Mi (0%)        0 (0%)         123m
  openshift-monitoring                    node-exporter-x8sf4                                                    9m (0%)       0 (0%)      47Mi (0%)        0 (0%)         164m
  openshift-monitoring                    sre-dns-latency-exporter-wn8rf                                         0 (0%)        0 (0%)      0 (0%)           0 (0%)         16h
  openshift-multus                        multus-additional-cni-plugins-jfcwt                                    10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         156m
  openshift-multus                        multus-zfjjh                                                           10m (0%)      0 (0%)      65Mi (0%)        0 (0%)         159m
  openshift-multus                        network-metrics-daemon-7h52k                                           20m (0%)      0 (0%)      120Mi (0%)       0 (0%)         160m
  openshift-network-diagnostics           network-check-target-2pwkk                                             10m (0%)      0 (0%)      15Mi (0%)        0 (0%)         159m
  openshift-ovn-kubernetes                ovnkube-master-q2tg5                                                   60m (0%)      0 (0%)      1520Mi (1%)      0 (0%)         140m
  openshift-ovn-kubernetes                ovnkube-node-j4p2h                                                     50m (0%)      0 (0%)      660Mi (0%)       0 (0%)         156m
  openshift-security                      audit-exporter-s9ms6                                                   100m (0%)     100m (0%)   256Mi (0%)       256Mi (0%)     16h
  openshift-security                      splunkforwarder-ds-9jgfs                                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         16h
  openshift-validation-webhook            validation-webhook-txrkw                                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         3h34m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests     Limits
  --------                    --------     ------
  cpu                         1199m (3%)   100m (0%)
  memory                      5968Mi (5%)  256Mi (0%)
  ephemeral-storage           0 (0%)       0 (0%)
  hugepages-1Gi               0 (0%)       0 (0%)
  hugepages-2Mi               0 (0%)       0 (0%)
  attachable-volumes-aws-ebs  0            0
Events:
  Type     Reason                     Age                From                 Message
  ----     ------                     ----               ----                 -------
  Normal   RegisteredNode             5h23m              node-controller      Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
  Normal   RegisteredNode             4h53m              node-controller      Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
  Normal   RegisteredNode             4h42m              node-controller      Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
  Normal   RegisteredNode             3h42m              node-controller      Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
  Normal   RegisteredNode             3h12m              node-controller      Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
  Normal   RegisteredNode             178m               node-controller      Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
  Normal   RegisteredNode             177m               node-controller      Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
  Normal   ConfigDriftMonitorStarted  126m               machineconfigdaemon  Config Drift Monitor started, watching against rendered-master-74371c0a6402ad69951f43db090a5937
  Normal   RegisteredNode             116m               node-controller      Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
  Normal   RegisteredNode             106m               node-controller      Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
  Normal   ConfigDriftMonitorStopped  91m                machineconfigdaemon  Config Drift Monitor stopped
  Normal   Cordon                     91m                machineconfigdaemon  Cordoned node to apply update
  Normal   Drain                      91m                machineconfigdaemon  Draining node to update config.
  Normal   NodeNotSchedulable         89m (x2 over 16h)  kubelet              Node ip-10-0-218-240.us-west-2.compute.internal status is now: NodeNotSchedulable
  Normal   RegisteredNode             65m                node-controller      Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
  Normal   RegisteredNode             55m                node-controller      Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
  Warning  FailedToDrain              31m                machineconfigdaemon  failed to drain node: ip-10-0-218-240.us-west-2.compute.internal after 1 hour. Please see machine-config-controller logs for more information

I0503 15:34:09.562518       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Evicted pod openshift-kube-scheduler-operator/openshift-kube-scheduler-operator-866f8c587c-js6k9
I0503 15:34:09.562576       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Waiting 1 minute then retrying. Error message from drain: [error when waiting for pod "apiserver-86f8f7df97-ctgz8" terminating: global timeout reached: 1m30s, error when waiting for pod "pod-identity-webhook-84b6dfbf4-kg9sn" terminating: global timeout reached: 1m30s, error when waiting for pod "oauth-openshift-6b595d45b4-t7vsn" terminating: global timeout reached: 1m30s, error when waiting for pod "apiserver-65c45c94d5-6rpjd" terminating: global timeout reached: 1m30s, error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s, error when waiting for pod "console-6cf648c696-gqzk6" terminating: global timeout reached: 1m30s, error when waiting for pod "multus-admission-controller-6f54b6494-8v9ws" terminating: global timeout reached: 1m30s, error when waiting for pod "managed-upgrade-operator-799b6d8974-nhbjn" terminating: global timeout reached: 1m30s]
I0503 15:38:47.117907       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 15:39:01.051732       1 drain_controller.go:142] evicting pod openshift-kube-scheduler/revision-pruner-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 15:39:01.051766       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-218-240.us-west-2.compute.internal
I0503 15:39:01.051768       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 15:39:01.051754       1 drain_controller.go:142] evicting pod openshift-etcd/revision-pruner-8-ip-10-0-218-240.us-west-2.compute.internal
I0503 15:39:01.051753       1 drain_controller.go:142] evicting pod openshift-kube-apiserver/revision-pruner-13-ip-10-0-218-240.us-west-2.compute.internal
I0503 15:40:16.499623       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Evicted pod openshift-kube-scheduler/revision-pruner-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 15:40:16.899279       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Evicted pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-218-240.us-west-2.compute.internal
I0503 15:40:17.099157       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Evicted pod openshift-etcd/revision-pruner-8-ip-10-0-218-240.us-west-2.compute.internal
I0503 15:40:17.301624       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Evicted pod openshift-kube-apiserver/revision-pruner-13-ip-10-0-218-240.us-west-2.compute.internal
I0503 15:40:31.699793       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Waiting 1 minute then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
I0503 15:42:15.311844       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 15:42:27.003118       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 15:43:43.096534       1 request.go:682] Waited for 10.623474152s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-controller-manager/pods/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 15:44:07.900120       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
I0503 15:48:54.508478       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 15:48:58.874832       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 15:50:32.894081       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
I0503 15:55:51.100778       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 15:56:04.770237       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-218-240.us-west-2.compute.internal
I0503 15:56:04.770246       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 15:56:51.496851       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Evicted pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-218-240.us-west-2.compute.internal
I0503 15:57:42.490381       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
I0503 16:01:42.703413       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 16:01:50.290563       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 16:03:22.091807       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
I0503 16:07:10.314175       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 16:07:14.619850       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 16:08:01.502029       1 request.go:682] Waited for 5.582592435s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-controller-manager/pods/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 16:08:45.704763       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
I0503 16:10:19.314321       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 16:10:27.599135       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 16:12:04.104785       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
I0503 16:17:48.137891       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 16:18:02.467945       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 16:19:37.705623       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
I0503 16:25:28.795958       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 16:25:36.650685       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 16:27:06.905900       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
I0503 16:30:42.457954       1 node_controller.go:436] Pool master[zone=us-west-2c]: node ip-10-0-218-240.us-west-2.compute.internal: changed annotation machineconfiguration.openshift.io/state = Degraded
I0503 16:30:42.457981       1 node_controller.go:436] Pool master[zone=us-west-2c]: node ip-10-0-218-240.us-west-2.compute.internal: changed annotation machineconfiguration.openshift.io/reason = failed to drain node: ip-10-0-218-240.us-west-2.compute.internal after 1 hour. Please see machine-config-controller logs for more information
I0503 16:30:42.458025       1 event.go:285] Event(v1.ObjectReference{Kind:"MachineConfigPool", Namespace:"", Name:"master", UID:"458576c2-92ce-4dc9-8d74-0c9bf73e84bc", APIVersion:"machineconfiguration.openshift.io/v1", ResourceVersion:"4485386", FieldPath:""}): type: 'Normal' reason: 'AnnotationChange' Node ip-10-0-218-240.us-west-2.compute.internal now has machineconfiguration.openshift.io/state=Degraded
I0503 16:30:42.458039       1 event.go:285] Event(v1.ObjectReference{Kind:"MachineConfigPool", Namespace:"", Name:"master", UID:"458576c2-92ce-4dc9-8d74-0c9bf73e84bc", APIVersion:"machineconfiguration.openshift.io/v1", ResourceVersion:"4485386", FieldPath:""}): type: 'Normal' reason: 'AnnotationChange' Node ip-10-0-218-240.us-west-2.compute.internal now has machineconfiguration.openshift.io/reason=failed to drain node: ip-10-0-218-240.us-west-2.compute.internal after 1 hour. Please see machine-config-controller logs for more information
I0503 16:30:47.466109       1 status.go:108] Degraded Machine: ip-10-0-218-240.us-west-2.compute.internal and Degraded Reason: failed to drain node: ip-10-0-218-240.us-west-2.compute.internal after 1 hour. Please see machine-config-controller logs for more information
I0503 16:30:52.537676       1 status.go:108] Degraded Machine: ip-10-0-218-240.us-west-2.compute.internal and Degraded Reason: failed to drain node: ip-10-0-218-240.us-west-2.compute.internal after 1 hour. Please see machine-config-controller logs for more information
I0503 16:31:37.317970       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 16:31:41.908812       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 16:33:12.137370       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
E0503 16:33:12.137419       1 drain_controller.go:350] node ip-10-0-218-240.us-west-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry.
I0503 16:33:12.137430       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 16:33:15.384961       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 16:34:45.408037       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
E0503 16:38:01.143850       1 drain_controller.go:350] node ip-10-0-218-240.us-west-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry.
I0503 16:38:01.143864       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 16:38:04.711285       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 16:39:34.728154       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
E0503 16:43:06.693748       1 drain_controller.go:350] node ip-10-0-218-240.us-west-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry.
I0503 16:43:06.693761       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 16:43:09.974369       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 16:44:39.992050       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
E0503 16:45:40.242252       1 drain_controller.go:350] node ip-10-0-218-240.us-west-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry.
I0503 16:45:40.242263       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 16:45:43.846551       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 16:46:36.631592       1 status.go:108] Degraded Machine: ip-10-0-218-240.us-west-2.compute.internal and Degraded Reason: failed to drain node: ip-10-0-218-240.us-west-2.compute.internal after 1 hour. Please see machine-config-controller logs for more information
I0503 16:47:13.864248       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
E0503 16:48:13.214901       1 drain_controller.go:350] node ip-10-0-218-240.us-west-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry.
I0503 16:48:13.214914       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 16:48:16.382573       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 16:49:46.400574       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
E0503 16:53:19.277354       1 drain_controller.go:350] node ip-10-0-218-240.us-west-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry.
I0503 16:53:19.277368       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 16:53:22.536138       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 16:54:52.552356       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
E0503 16:58:25.169846       1 drain_controller.go:350] node ip-10-0-218-240.us-west-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry.
I0503 16:58:25.169861       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
I0503 16:58:28.907471       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
I0503 16:59:58.923551       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global 
I0503 16:59:58.923551       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
bash-3.2$ oc project openshift-kube-controller-manager
Now using project "openshift-kube-controller-manager" on server "https://api.test-upgrade.4scv.s1.devshift.org:6443".
(reverse-i-search)`': 
bash-3.2$ oc get pods
NAME                                                                       READY   STATUS        RESTARTS      AGE
installer-9-ip-10-0-218-240.us-west-2.compute.internal                     0/1     Terminating   0             16h
kube-controller-manager-guard-ip-10-0-129-169.us-west-2.compute.internal   1/1     Running       0             113m
kube-controller-manager-guard-ip-10-0-176-172.us-west-2.compute.internal   1/1     Running       0             92m
kube-controller-manager-ip-10-0-129-169.us-west-2.compute.internal         4/4     Running       7 (56m ago)   177m
kube-controller-manager-ip-10-0-176-172.us-west-2.compute.internal         4/4     Running       4             179m
kube-controller-manager-ip-10-0-218-240.us-west-2.compute.internal         4/4     Running       2 (67m ago)   3h
revision-pruner-11-ip-10-0-129-169.us-west-2.compute.internal              0/1     Completed     0             121m
revision-pruner-11-ip-10-0-176-172.us-west-2.compute.internal              0/1     Completed     0             101m
(reverse-i-search)`de': oc describe node/ip-10-0-218-240.us-west-2.compute.internal
bash-3.2$ oc describe pod/installer-9-ip-10-0-218-240.us-west-2.compute.internal
Name:                      installer-9-ip-10-0-218-240.us-west-2.compute.internal
Namespace:                 openshift-kube-controller-manager
Priority:                  2000001000
Priority Class Name:       system-node-critical
Service Account:           installer-sa
Node:                      ip-10-0-218-240.us-west-2.compute.internal/10.0.218.240
Start Time:                Tue, 02 May 2023 19:17:31 -0500
Labels:                    app=installer
Annotations:               k8s.ovn.org/pod-networks:
                             {"default":{"ip_addresses":["10.129.0.43/23"],"mac_address":"0a:58:0a:81:00:2b","gateway_ips":["10.129.0.1"],"ip_address":"10.129.0.43/23"...
                           k8s.v1.cni.cncf.io/network-status:
                             [{
                                 "name": "ovn-kubernetes",
                                 "interface": "eth0",
                                 "ips": [
                                     "10.129.0.43"
                                 ],
                                 "mac": "0a:58:0a:81:00:2b",
                                 "default": true,
                                 "dns": {}
                             }]
Status:                    Terminating (lasts 16h)
Termination Grace Period:  30s
IP:                        10.129.0.43
IPs:
  IP:  10.129.0.43
Containers:
  installer:
    Container ID:  cri-o://8bae6acb523c145e55b86720fed4bb81c95a8a4e1295c4c901057038c780ce55
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6ce10c67651c6bf6f12251a895b0fd8c3b1f74bd9d283e1eb4562c6cb07efff7
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6ce10c67651c6bf6f12251a895b0fd8c3b1f74bd9d283e1eb4562c6cb07efff7
    Port:          <none>
    Host Port:     <none>
    Command:
      cluster-kube-controller-manager-operator
      installer
    Args:
      -v=2
      --revision=9
      --namespace=openshift-kube-controller-manager
      --pod=kube-controller-manager-pod
      --resource-dir=/etc/kubernetes/static-pod-resources
      --pod-manifest-dir=/etc/kubernetes/manifests
      --configmaps=kube-controller-manager-pod
      --configmaps=config
      --configmaps=cluster-policy-controller-config
      --configmaps=controller-manager-kubeconfig
      --optional-configmaps=cloud-config
      --configmaps=kube-controller-cert-syncer-kubeconfig
      --configmaps=serviceaccount-ca
      --configmaps=service-ca
      --configmaps=recycler-config
      --secrets=service-account-private-key
      --optional-secrets=serving-cert
      --secrets=localhost-recovery-client-token
      --cert-dir=/etc/kubernetes/static-pod-resources/kube-controller-manager-certs
      --cert-configmaps=aggregator-client-ca
      --cert-configmaps=client-ca
      --optional-cert-configmaps=trusted-ca-bundle
      --cert-secrets=kube-controller-manager-client-cert-key
      --cert-secrets=csr-signer
    State:      Terminated
      Reason:   Error
      Message:  0] Creating target resource directory "/etc/kubernetes/static-pod-resources/kube-controller-manager-pod-9" ...
I0503 00:18:04.654371       1 cmd.go:218] Creating target resource directory "/etc/kubernetes/static-pod-resources/kube-controller-manager-pod-9" ...
I0503 00:18:04.654385       1 cmd.go:226] Getting secrets ...
I0503 00:18:04.657165       1 copy.go:32] Got secret openshift-kube-controller-manager/localhost-recovery-client-token-9
I0503 00:18:04.659036       1 copy.go:32] Got secret openshift-kube-controller-manager/service-account-private-key-9
I0503 00:18:04.729178       1 copy.go:32] Got secret openshift-kube-controller-manager/serving-cert-9
I0503 00:18:04.729221       1 cmd.go:239] Getting config maps ...
I0503 00:18:04.731598       1 copy.go:60] Got configMap openshift-kube-controller-manager/cluster-policy-controller-config-9
I0503 00:18:04.733267       1 copy.go:60] Got configMap openshift-kube-controller-manager/config-9
I0503 00:18:04.734877       1 copy.go:60] Got configMap openshift-kube-controller-manager/controller-manager-kubeconfig-9
I0503 00:18:04.738125       1 copy.go:60] Got configMap openshift-kube-controller-manager/kube-controller-cert-syncer-kubeconfig-9
I0503 00:18:04.740282       1 copy.go:60] Got configMap openshift-kube-controller-manager/kube-controller-manager-pod-9
I0503 00:18:04.850337       1 copy.go:60] Got configMap openshift-kube-controller-manager/recycler-config-9
I0503 00:18:05.052508       1 copy.go:60] Got configMap openshift-kube-controller-manager/service-ca-9
I0503 00:18:05.253415       1 copy.go:60] Got configMap openshift-kube-controller-manager/serviceaccount-ca-9
I0503 00:18:05.291982       1 cmd.go:124] Received SIGTERM or SIGINT signal, shutting down the process.
I0503 00:18:05.292067       1 copy.go:52] Failed to get config map openshift-kube-controller-manager/cloud-config-9: client rate limiter Wait returned an error: context canceled
F0503 00:18:05.451745       1 cmd.go:106] failed to copy: client rate limiter Wait returned an error: context canceled

      Exit Code:    1
      Started:      Tue, 02 May 2023 19:17:34 -0500
      Finished:     Tue, 02 May 2023 19:18:05 -0500
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     150m
      memory:  200M
    Requests:
      cpu:     150m
      memory:  200M
    Environment:
      POD_NAME:   installer-9-ip-10-0-218-240.us-west-2.compute.internal (v1:metadata.name)
      NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /etc/kubernetes/ from kubelet-dir (rw)
      /var/lock from var-lock (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access (ro)
Conditions:
  Type               Status
  DisruptionTarget   True 
  Initialized        True 
  Ready              False 
  ContainersReady    False 
  PodScheduled       True 
Volumes:
  kubelet-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/kubernetes/
    HostPathType:  
  var-lock:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lock
    HostPathType:  
  kube-api-access:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3600
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 op=Exists

Attachments

Activity

People

Assignee:: Sai Ramesh Vanka

Reporter:: Sai Sindhur Malleni

QA Contact:: ying zhou

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 2023/05/03 7:29 PM

Updated:: 2023/07/03 6:20 AM

Resolved:: 2023/07/03 6:20 AM