Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-13068

Error terminating kube-controller-manager operator installer pod causing upgrades to be stuck due to node draining failure

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Normal
    • None
    • 4.13
    • Node / Kubelet
    • No
    • OCPNODE Sprint 237 (Blue), OCPNODE Sprint 238 (Blue)
    • 2
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      While trying to upgrade a loaded cluster at 120 nodes (ROSA) one of the control plane nodes fails to drain causing the upgrade to be stuck.

      Version-Release number of selected component (if applicable):

      4.13.0-rc.4 to 4.13.0-rc6

      How reproducible:

      Happened on one attempt

      Steps to Reproduce:

      1. Install a 120 node cluster
      2.  Load up the cluster using cluster-density-v1 with ITERATIONS=4000 and gc=false (https://github.com/cloud-bulldozer/e2e-benchmarking/tree/master/workloads/kube-burner-ocp-wrapper) 3. Upgrade cluster
      

      Actual results:

      Upgrade is stuck as the control plane MCP never upgrades. Manual intervention was required to delete the pod stuck in terminating to move the upgrade along

      Expected results:

      Upgrade should succeed without any manual intervention

      Additional info:

      bash-3.2$ oc project openshift-machine-api
      ocNow using project "openshift-machine-api" on server "https://api.test-upgrade.4scv.s1.devshift.org:6443".
      bash-3.2$ oc get mcp
      NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
      master   rendered-master-74371c0a6402ad69951f43db090a5937   False     True       True       3              2                   2                     1                      17h
      worker   rendered-worker-06da68164c0fcd25c54fc3cffc504e7d   True      False      False      186            186                 186                   0                      17h
      
      bash-3.2$ oc get nodes | grep control
      ip-10-0-129-169.us-west-2.compute.internal   Ready                      control-plane,master   17h     v1.26.3+b404935
      ip-10-0-176-172.us-west-2.compute.internal   Ready                      control-plane,master   17h     v1.26.3+b404935
      ip-10-0-218-240.us-west-2.compute.internal   Ready,SchedulingDisabled   control-plane,master   17h     v1.26.3+befad9d
      
      bash-3.2$ oc describe node/ip-10-0-218-240.us-west-2.compute.internal
      Name:               ip-10-0-218-240.us-west-2.compute.internal
      Roles:              control-plane,master
      Labels:             beta.kubernetes.io/arch=amd64
                          beta.kubernetes.io/instance-type=m5.8xlarge
                          beta.kubernetes.io/os=linux
                          failure-domain.beta.kubernetes.io/region=us-west-2
                          failure-domain.beta.kubernetes.io/zone=us-west-2c
                          kubernetes.io/arch=amd64
                          kubernetes.io/hostname=ip-10-0-218-240.us-west-2.compute.internal
                          kubernetes.io/os=linux
                          node-role.kubernetes.io/control-plane=
                          node-role.kubernetes.io/master=
                          node.kubernetes.io/instance-type=m5.8xlarge
                          node.openshift.io/os_id=rhcos
                          topology.ebs.csi.aws.com/zone=us-west-2c
                          topology.kubernetes.io/region=us-west-2
                          topology.kubernetes.io/zone=us-west-2c
      Annotations:        cloud.network.openshift.io/egress-ipconfig:
                            [{"interface":"eni-0d00e83bfcf951d97","ifaddr":{"ipv4":"10.0.192.0/19"},"capacity":{"ipv4":29,"ipv6":30}}]
                          csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0fcda6bf3578f7407"}
                          k8s.ovn.org/host-addresses: ["10.0.218.240"]
                          k8s.ovn.org/l3-gateway-config:
                            {"default":{"mode":"shared","interface-id":"br-ex_ip-10-0-218-240.us-west-2.compute.internal","mac-address":"0a:95:ef:fa:9c:17","ip-addres...
                          k8s.ovn.org/node-chassis-id: c7c5d262-341e-481c-804a-da6b4a085e63
                          k8s.ovn.org/node-gateway-router-lrp-ifaddr: {"ipv4":"100.64.0.4/16"}
                          k8s.ovn.org/node-mgmt-port-mac-address: 72:35:cc:3d:dc:90
                          k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.0.218.240/19"}
                          k8s.ovn.org/node-subnets: {"default":["10.129.0.0/23"]}
                          machine.openshift.io/machine: openshift-machine-api/test-upgrade-g9wl2-master-2
                          machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
                          machineconfiguration.openshift.io/currentConfig: rendered-master-74371c0a6402ad69951f43db090a5937
                          machineconfiguration.openshift.io/desiredConfig: rendered-master-bdb8565e5d621ced44f3ebd66713dc05
                          machineconfiguration.openshift.io/desiredDrain: drain-rendered-master-bdb8565e5d621ced44f3ebd66713dc05
                          machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-master-74371c0a6402ad69951f43db090a5937
                          machineconfiguration.openshift.io/lastSyncedControllerConfigResourceVersion: 4110931
                          machineconfiguration.openshift.io/reason:
                            failed to drain node: ip-10-0-218-240.us-west-2.compute.internal after 1 hour. Please see machine-config-controller logs for more informat...
                          machineconfiguration.openshift.io/state: Degraded
                          volumes.kubernetes.io/controller-managed-attach-detach: true
      CreationTimestamp:  Tue, 02 May 2023 18:39:17 -0500
      Taints:             node-role.kubernetes.io/master:NoSchedule
                          node.kubernetes.io/unschedulable:NoSchedule
      Unschedulable:      true
      Lease:
        HolderIdentity:  ip-10-0-218-240.us-west-2.compute.internal
        AcquireTime:     <unset>
        RenewTime:       Wed, 03 May 2023 12:01:43 -0500
      Conditions:
        Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
        ----             ------  -----------------                 ------------------                ------                       -------
        MemoryPressure   False   Wed, 03 May 2023 11:58:20 -0500   Tue, 02 May 2023 19:20:49 -0500   KubeletHasSufficientMemory   kubelet has sufficient memory available
        DiskPressure     False   Wed, 03 May 2023 11:58:20 -0500   Tue, 02 May 2023 19:20:49 -0500   KubeletHasNoDiskPressure     kubelet has no disk pressure
        PIDPressure      False   Wed, 03 May 2023 11:58:20 -0500   Tue, 02 May 2023 19:20:49 -0500   KubeletHasSufficientPID      kubelet has sufficient PID available
        Ready            True    Wed, 03 May 2023 11:58:20 -0500   Tue, 02 May 2023 19:20:49 -0500   KubeletReady                 kubelet is posting ready status
      Addresses:
        InternalIP:   10.0.218.240
        Hostname:     ip-10-0-218-240.us-west-2.compute.internal
        InternalDNS:  ip-10-0-218-240.us-west-2.compute.internal
      Capacity:
        attachable-volumes-aws-ebs:  25
        cpu:                         32
        ephemeral-storage:           366410732Ki
        hugepages-1Gi:               0
        hugepages-2Mi:               0
        memory:                      130397904Ki
        pods:                        250
      Allocatable:
        attachable-volumes-aws-ebs:  25
        cpu:                         31850m
        ephemeral-storage:           336610388229
        hugepages-1Gi:               0
        hugepages-2Mi:               0
        memory:                      120858320Ki
        pods:                        250
      System Info:
        Machine ID:                             ec21357d1e7ff0abc0f899ce50f1ed57
        System UUID:                            ec21357d-1e7f-f0ab-c0f8-99ce50f1ed57
        Boot ID:                                8ed83c2e-bb8c-47cf-9a5c-8b50db65f45a
        Kernel Version:                         5.14.0-284.10.1.el9_2.x86_64
        OS Image:                               Red Hat Enterprise Linux CoreOS 413.92.202304140330-0 (Plow)
        Operating System:                       linux
        Architecture:                           amd64
        Container Runtime Version:              cri-o://1.26.3-3.rhaos4.13.git641290e.el9
        Kubelet Version:                        v1.26.3+befad9d
        Kube-Proxy Version:                     v1.26.3+befad9d
      ProviderID:                               aws:///us-west-2c/i-0fcda6bf3578f7407
      Non-terminated Pods:                      (22 in total)
        Namespace                               Name                                                                   CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
        ---------                               ----                                                                   ------------  ----------  ---------------  -------------  ---
        openshift-cluster-csi-drivers           aws-ebs-csi-driver-node-hr6fx                                          30m (0%)      0 (0%)      150Mi (0%)       0 (0%)         158m
        openshift-cluster-node-tuning-operator  tuned-c24fg                                                            10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         162m
        openshift-dns                           dns-default-n8nzs                                                      60m (0%)      0 (0%)      110Mi (0%)       0 (0%)         128m
        openshift-dns                           node-resolver-9d4d8                                                    5m (0%)       0 (0%)      21Mi (0%)        0 (0%)         134m
        openshift-etcd                          etcd-ip-10-0-218-240.us-west-2.compute.internal                        360m (1%)     0 (0%)      910Mi (0%)       0 (0%)         3h9m
        openshift-image-registry                node-ca-l58ct                                                          10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         164m
        openshift-kube-apiserver                kube-apiserver-ip-10-0-218-240.us-west-2.compute.internal              290m (0%)     0 (0%)      1224Mi (1%)      0 (0%)         3h10m
        openshift-kube-controller-manager       kube-controller-manager-ip-10-0-218-240.us-west-2.compute.internal     80m (0%)      0 (0%)      500Mi (0%)       0 (0%)         179m
        openshift-kube-scheduler                openshift-kube-scheduler-ip-10-0-218-240.us-west-2.compute.internal    25m (0%)      0 (0%)      150Mi (0%)       0 (0%)         178m
        openshift-machine-config-operator       machine-config-daemon-5rrrx                                            40m (0%)      0 (0%)      100Mi (0%)       0 (0%)         126m
        openshift-machine-config-operator       machine-config-server-mgvkz                                            20m (0%)      0 (0%)      50Mi (0%)        0 (0%)         123m
        openshift-monitoring                    node-exporter-x8sf4                                                    9m (0%)       0 (0%)      47Mi (0%)        0 (0%)         164m
        openshift-monitoring                    sre-dns-latency-exporter-wn8rf                                         0 (0%)        0 (0%)      0 (0%)           0 (0%)         16h
        openshift-multus                        multus-additional-cni-plugins-jfcwt                                    10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         156m
        openshift-multus                        multus-zfjjh                                                           10m (0%)      0 (0%)      65Mi (0%)        0 (0%)         159m
        openshift-multus                        network-metrics-daemon-7h52k                                           20m (0%)      0 (0%)      120Mi (0%)       0 (0%)         160m
        openshift-network-diagnostics           network-check-target-2pwkk                                             10m (0%)      0 (0%)      15Mi (0%)        0 (0%)         159m
        openshift-ovn-kubernetes                ovnkube-master-q2tg5                                                   60m (0%)      0 (0%)      1520Mi (1%)      0 (0%)         140m
        openshift-ovn-kubernetes                ovnkube-node-j4p2h                                                     50m (0%)      0 (0%)      660Mi (0%)       0 (0%)         156m
        openshift-security                      audit-exporter-s9ms6                                                   100m (0%)     100m (0%)   256Mi (0%)       256Mi (0%)     16h
        openshift-security                      splunkforwarder-ds-9jgfs                                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         16h
        openshift-validation-webhook            validation-webhook-txrkw                                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         3h34m
      Allocated resources:
        (Total limits may be over 100 percent, i.e., overcommitted.)
        Resource                    Requests     Limits
        --------                    --------     ------
        cpu                         1199m (3%)   100m (0%)
        memory                      5968Mi (5%)  256Mi (0%)
        ephemeral-storage           0 (0%)       0 (0%)
        hugepages-1Gi               0 (0%)       0 (0%)
        hugepages-2Mi               0 (0%)       0 (0%)
        attachable-volumes-aws-ebs  0            0
      Events:
        Type     Reason                     Age                From                 Message
        ----     ------                     ----               ----                 -------
        Normal   RegisteredNode             5h23m              node-controller      Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
        Normal   RegisteredNode             4h53m              node-controller      Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
        Normal   RegisteredNode             4h42m              node-controller      Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
        Normal   RegisteredNode             3h42m              node-controller      Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
        Normal   RegisteredNode             3h12m              node-controller      Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
        Normal   RegisteredNode             178m               node-controller      Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
        Normal   RegisteredNode             177m               node-controller      Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
        Normal   ConfigDriftMonitorStarted  126m               machineconfigdaemon  Config Drift Monitor started, watching against rendered-master-74371c0a6402ad69951f43db090a5937
        Normal   RegisteredNode             116m               node-controller      Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
        Normal   RegisteredNode             106m               node-controller      Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
        Normal   ConfigDriftMonitorStopped  91m                machineconfigdaemon  Config Drift Monitor stopped
        Normal   Cordon                     91m                machineconfigdaemon  Cordoned node to apply update
        Normal   Drain                      91m                machineconfigdaemon  Draining node to update config.
        Normal   NodeNotSchedulable         89m (x2 over 16h)  kubelet              Node ip-10-0-218-240.us-west-2.compute.internal status is now: NodeNotSchedulable
        Normal   RegisteredNode             65m                node-controller      Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
        Normal   RegisteredNode             55m                node-controller      Node ip-10-0-218-240.us-west-2.compute.internal event: Registered Node ip-10-0-218-240.us-west-2.compute.internal in Controller
        Warning  FailedToDrain              31m                machineconfigdaemon  failed to drain node: ip-10-0-218-240.us-west-2.compute.internal after 1 hour. Please see machine-config-controller logs for more information
      
      I0503 15:34:09.562518       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Evicted pod openshift-kube-scheduler-operator/openshift-kube-scheduler-operator-866f8c587c-js6k9
      I0503 15:34:09.562576       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Waiting 1 minute then retrying. Error message from drain: [error when waiting for pod "apiserver-86f8f7df97-ctgz8" terminating: global timeout reached: 1m30s, error when waiting for pod "pod-identity-webhook-84b6dfbf4-kg9sn" terminating: global timeout reached: 1m30s, error when waiting for pod "oauth-openshift-6b595d45b4-t7vsn" terminating: global timeout reached: 1m30s, error when waiting for pod "apiserver-65c45c94d5-6rpjd" terminating: global timeout reached: 1m30s, error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s, error when waiting for pod "console-6cf648c696-gqzk6" terminating: global timeout reached: 1m30s, error when waiting for pod "multus-admission-controller-6f54b6494-8v9ws" terminating: global timeout reached: 1m30s, error when waiting for pod "managed-upgrade-operator-799b6d8974-nhbjn" terminating: global timeout reached: 1m30s]
      I0503 15:38:47.117907       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
      I0503 15:39:01.051732       1 drain_controller.go:142] evicting pod openshift-kube-scheduler/revision-pruner-9-ip-10-0-218-240.us-west-2.compute.internal
      I0503 15:39:01.051766       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-218-240.us-west-2.compute.internal
      I0503 15:39:01.051768       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
      I0503 15:39:01.051754       1 drain_controller.go:142] evicting pod openshift-etcd/revision-pruner-8-ip-10-0-218-240.us-west-2.compute.internal
      I0503 15:39:01.051753       1 drain_controller.go:142] evicting pod openshift-kube-apiserver/revision-pruner-13-ip-10-0-218-240.us-west-2.compute.internal
      I0503 15:40:16.499623       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Evicted pod openshift-kube-scheduler/revision-pruner-9-ip-10-0-218-240.us-west-2.compute.internal
      I0503 15:40:16.899279       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Evicted pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-218-240.us-west-2.compute.internal
      I0503 15:40:17.099157       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Evicted pod openshift-etcd/revision-pruner-8-ip-10-0-218-240.us-west-2.compute.internal
      I0503 15:40:17.301624       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Evicted pod openshift-kube-apiserver/revision-pruner-13-ip-10-0-218-240.us-west-2.compute.internal
      I0503 15:40:31.699793       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Waiting 1 minute then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
      I0503 15:42:15.311844       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
      I0503 15:42:27.003118       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
      I0503 15:43:43.096534       1 request.go:682] Waited for 10.623474152s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-controller-manager/pods/installer-9-ip-10-0-218-240.us-west-2.compute.internal
      I0503 15:44:07.900120       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
      I0503 15:48:54.508478       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
      I0503 15:48:58.874832       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
      I0503 15:50:32.894081       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
      I0503 15:55:51.100778       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
      I0503 15:56:04.770237       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-218-240.us-west-2.compute.internal
      I0503 15:56:04.770246       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
      I0503 15:56:51.496851       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Evicted pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-218-240.us-west-2.compute.internal
      I0503 15:57:42.490381       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
      I0503 16:01:42.703413       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
      I0503 16:01:50.290563       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
      I0503 16:03:22.091807       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
      I0503 16:07:10.314175       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
      I0503 16:07:14.619850       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
      I0503 16:08:01.502029       1 request.go:682] Waited for 5.582592435s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-kube-controller-manager/pods/installer-9-ip-10-0-218-240.us-west-2.compute.internal
      I0503 16:08:45.704763       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
      I0503 16:10:19.314321       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
      I0503 16:10:27.599135       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
      I0503 16:12:04.104785       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
      I0503 16:17:48.137891       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
      I0503 16:18:02.467945       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
      I0503 16:19:37.705623       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
      I0503 16:25:28.795958       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
      I0503 16:25:36.650685       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
      I0503 16:27:06.905900       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
      I0503 16:30:42.457954       1 node_controller.go:436] Pool master[zone=us-west-2c]: node ip-10-0-218-240.us-west-2.compute.internal: changed annotation machineconfiguration.openshift.io/state = Degraded
      I0503 16:30:42.457981       1 node_controller.go:436] Pool master[zone=us-west-2c]: node ip-10-0-218-240.us-west-2.compute.internal: changed annotation machineconfiguration.openshift.io/reason = failed to drain node: ip-10-0-218-240.us-west-2.compute.internal after 1 hour. Please see machine-config-controller logs for more information
      I0503 16:30:42.458025       1 event.go:285] Event(v1.ObjectReference{Kind:"MachineConfigPool", Namespace:"", Name:"master", UID:"458576c2-92ce-4dc9-8d74-0c9bf73e84bc", APIVersion:"machineconfiguration.openshift.io/v1", ResourceVersion:"4485386", FieldPath:""}): type: 'Normal' reason: 'AnnotationChange' Node ip-10-0-218-240.us-west-2.compute.internal now has machineconfiguration.openshift.io/state=Degraded
      I0503 16:30:42.458039       1 event.go:285] Event(v1.ObjectReference{Kind:"MachineConfigPool", Namespace:"", Name:"master", UID:"458576c2-92ce-4dc9-8d74-0c9bf73e84bc", APIVersion:"machineconfiguration.openshift.io/v1", ResourceVersion:"4485386", FieldPath:""}): type: 'Normal' reason: 'AnnotationChange' Node ip-10-0-218-240.us-west-2.compute.internal now has machineconfiguration.openshift.io/reason=failed to drain node: ip-10-0-218-240.us-west-2.compute.internal after 1 hour. Please see machine-config-controller logs for more information
      I0503 16:30:47.466109       1 status.go:108] Degraded Machine: ip-10-0-218-240.us-west-2.compute.internal and Degraded Reason: failed to drain node: ip-10-0-218-240.us-west-2.compute.internal after 1 hour. Please see machine-config-controller logs for more information
      I0503 16:30:52.537676       1 status.go:108] Degraded Machine: ip-10-0-218-240.us-west-2.compute.internal and Degraded Reason: failed to drain node: ip-10-0-218-240.us-west-2.compute.internal after 1 hour. Please see machine-config-controller logs for more information
      I0503 16:31:37.317970       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
      I0503 16:31:41.908812       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
      I0503 16:33:12.137370       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
      E0503 16:33:12.137419       1 drain_controller.go:350] node ip-10-0-218-240.us-west-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry.
      I0503 16:33:12.137430       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
      I0503 16:33:15.384961       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
      I0503 16:34:45.408037       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
      E0503 16:38:01.143850       1 drain_controller.go:350] node ip-10-0-218-240.us-west-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry.
      I0503 16:38:01.143864       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
      I0503 16:38:04.711285       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
      I0503 16:39:34.728154       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
      E0503 16:43:06.693748       1 drain_controller.go:350] node ip-10-0-218-240.us-west-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry.
      I0503 16:43:06.693761       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
      I0503 16:43:09.974369       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
      I0503 16:44:39.992050       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
      E0503 16:45:40.242252       1 drain_controller.go:350] node ip-10-0-218-240.us-west-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry.
      I0503 16:45:40.242263       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
      I0503 16:45:43.846551       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
      I0503 16:46:36.631592       1 status.go:108] Degraded Machine: ip-10-0-218-240.us-west-2.compute.internal and Degraded Reason: failed to drain node: ip-10-0-218-240.us-west-2.compute.internal after 1 hour. Please see machine-config-controller logs for more information
      I0503 16:47:13.864248       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
      E0503 16:48:13.214901       1 drain_controller.go:350] node ip-10-0-218-240.us-west-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry.
      I0503 16:48:13.214914       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
      I0503 16:48:16.382573       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
      I0503 16:49:46.400574       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
      E0503 16:53:19.277354       1 drain_controller.go:350] node ip-10-0-218-240.us-west-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry.
      I0503 16:53:19.277368       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
      I0503 16:53:22.536138       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
      I0503 16:54:52.552356       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
      E0503 16:58:25.169846       1 drain_controller.go:350] node ip-10-0-218-240.us-west-2.compute.internal: drain exceeded timeout: 1h0m0s. Will continue to retry.
      I0503 16:58:25.169861       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: initiating drain
      I0503 16:58:28.907471       1 drain_controller.go:142] evicting pod openshift-kube-controller-manager/installer-9-ip-10-0-218-240.us-west-2.compute.internal
      I0503 16:59:58.923551       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global 
      I0503 16:59:58.923551       1 drain_controller.go:171] node ip-10-0-218-240.us-west-2.compute.internal: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when waiting for pod "installer-9-ip-10-0-218-240.us-west-2.compute.internal" terminating: global timeout reached: 1m30s
      bash-3.2$ oc project openshift-kube-controller-manager
      Now using project "openshift-kube-controller-manager" on server "https://api.test-upgrade.4scv.s1.devshift.org:6443".
      (reverse-i-search)`': 
      bash-3.2$ oc get pods
      NAME                                                                       READY   STATUS        RESTARTS      AGE
      installer-9-ip-10-0-218-240.us-west-2.compute.internal                     0/1     Terminating   0             16h
      kube-controller-manager-guard-ip-10-0-129-169.us-west-2.compute.internal   1/1     Running       0             113m
      kube-controller-manager-guard-ip-10-0-176-172.us-west-2.compute.internal   1/1     Running       0             92m
      kube-controller-manager-ip-10-0-129-169.us-west-2.compute.internal         4/4     Running       7 (56m ago)   177m
      kube-controller-manager-ip-10-0-176-172.us-west-2.compute.internal         4/4     Running       4             179m
      kube-controller-manager-ip-10-0-218-240.us-west-2.compute.internal         4/4     Running       2 (67m ago)   3h
      revision-pruner-11-ip-10-0-129-169.us-west-2.compute.internal              0/1     Completed     0             121m
      revision-pruner-11-ip-10-0-176-172.us-west-2.compute.internal              0/1     Completed     0             101m
      (reverse-i-search)`de': oc describe node/ip-10-0-218-240.us-west-2.compute.internal
      bash-3.2$ oc describe pod/installer-9-ip-10-0-218-240.us-west-2.compute.internal
      Name:                      installer-9-ip-10-0-218-240.us-west-2.compute.internal
      Namespace:                 openshift-kube-controller-manager
      Priority:                  2000001000
      Priority Class Name:       system-node-critical
      Service Account:           installer-sa
      Node:                      ip-10-0-218-240.us-west-2.compute.internal/10.0.218.240
      Start Time:                Tue, 02 May 2023 19:17:31 -0500
      Labels:                    app=installer
      Annotations:               k8s.ovn.org/pod-networks:
                                   {"default":{"ip_addresses":["10.129.0.43/23"],"mac_address":"0a:58:0a:81:00:2b","gateway_ips":["10.129.0.1"],"ip_address":"10.129.0.43/23"...
                                 k8s.v1.cni.cncf.io/network-status:
                                   [{
                                       "name": "ovn-kubernetes",
                                       "interface": "eth0",
                                       "ips": [
                                           "10.129.0.43"
                                       ],
                                       "mac": "0a:58:0a:81:00:2b",
                                       "default": true,
                                       "dns": {}
                                   }]
      Status:                    Terminating (lasts 16h)
      Termination Grace Period:  30s
      IP:                        10.129.0.43
      IPs:
        IP:  10.129.0.43
      Containers:
        installer:
          Container ID:  cri-o://8bae6acb523c145e55b86720fed4bb81c95a8a4e1295c4c901057038c780ce55
          Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6ce10c67651c6bf6f12251a895b0fd8c3b1f74bd9d283e1eb4562c6cb07efff7
          Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6ce10c67651c6bf6f12251a895b0fd8c3b1f74bd9d283e1eb4562c6cb07efff7
          Port:          <none>
          Host Port:     <none>
          Command:
            cluster-kube-controller-manager-operator
            installer
          Args:
            -v=2
            --revision=9
            --namespace=openshift-kube-controller-manager
            --pod=kube-controller-manager-pod
            --resource-dir=/etc/kubernetes/static-pod-resources
            --pod-manifest-dir=/etc/kubernetes/manifests
            --configmaps=kube-controller-manager-pod
            --configmaps=config
            --configmaps=cluster-policy-controller-config
            --configmaps=controller-manager-kubeconfig
            --optional-configmaps=cloud-config
            --configmaps=kube-controller-cert-syncer-kubeconfig
            --configmaps=serviceaccount-ca
            --configmaps=service-ca
            --configmaps=recycler-config
            --secrets=service-account-private-key
            --optional-secrets=serving-cert
            --secrets=localhost-recovery-client-token
            --cert-dir=/etc/kubernetes/static-pod-resources/kube-controller-manager-certs
            --cert-configmaps=aggregator-client-ca
            --cert-configmaps=client-ca
            --optional-cert-configmaps=trusted-ca-bundle
            --cert-secrets=kube-controller-manager-client-cert-key
            --cert-secrets=csr-signer
          State:      Terminated
            Reason:   Error
            Message:  0] Creating target resource directory "/etc/kubernetes/static-pod-resources/kube-controller-manager-pod-9" ...
      I0503 00:18:04.654371       1 cmd.go:218] Creating target resource directory "/etc/kubernetes/static-pod-resources/kube-controller-manager-pod-9" ...
      I0503 00:18:04.654385       1 cmd.go:226] Getting secrets ...
      I0503 00:18:04.657165       1 copy.go:32] Got secret openshift-kube-controller-manager/localhost-recovery-client-token-9
      I0503 00:18:04.659036       1 copy.go:32] Got secret openshift-kube-controller-manager/service-account-private-key-9
      I0503 00:18:04.729178       1 copy.go:32] Got secret openshift-kube-controller-manager/serving-cert-9
      I0503 00:18:04.729221       1 cmd.go:239] Getting config maps ...
      I0503 00:18:04.731598       1 copy.go:60] Got configMap openshift-kube-controller-manager/cluster-policy-controller-config-9
      I0503 00:18:04.733267       1 copy.go:60] Got configMap openshift-kube-controller-manager/config-9
      I0503 00:18:04.734877       1 copy.go:60] Got configMap openshift-kube-controller-manager/controller-manager-kubeconfig-9
      I0503 00:18:04.738125       1 copy.go:60] Got configMap openshift-kube-controller-manager/kube-controller-cert-syncer-kubeconfig-9
      I0503 00:18:04.740282       1 copy.go:60] Got configMap openshift-kube-controller-manager/kube-controller-manager-pod-9
      I0503 00:18:04.850337       1 copy.go:60] Got configMap openshift-kube-controller-manager/recycler-config-9
      I0503 00:18:05.052508       1 copy.go:60] Got configMap openshift-kube-controller-manager/service-ca-9
      I0503 00:18:05.253415       1 copy.go:60] Got configMap openshift-kube-controller-manager/serviceaccount-ca-9
      I0503 00:18:05.291982       1 cmd.go:124] Received SIGTERM or SIGINT signal, shutting down the process.
      I0503 00:18:05.292067       1 copy.go:52] Failed to get config map openshift-kube-controller-manager/cloud-config-9: client rate limiter Wait returned an error: context canceled
      F0503 00:18:05.451745       1 cmd.go:106] failed to copy: client rate limiter Wait returned an error: context canceled
      
            Exit Code:    1
            Started:      Tue, 02 May 2023 19:17:34 -0500
            Finished:     Tue, 02 May 2023 19:18:05 -0500
          Ready:          False
          Restart Count:  0
          Limits:
            cpu:     150m
            memory:  200M
          Requests:
            cpu:     150m
            memory:  200M
          Environment:
            POD_NAME:   installer-9-ip-10-0-218-240.us-west-2.compute.internal (v1:metadata.name)
            NODE_NAME:   (v1:spec.nodeName)
          Mounts:
            /etc/kubernetes/ from kubelet-dir (rw)
            /var/lock from var-lock (rw)
            /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access (ro)
      Conditions:
        Type               Status
        DisruptionTarget   True 
        Initialized        True 
        Ready              False 
        ContainersReady    False 
        PodScheduled       True 
      Volumes:
        kubelet-dir:
          Type:          HostPath (bare host directory volume)
          Path:          /etc/kubernetes/
          HostPathType:  
        var-lock:
          Type:          HostPath (bare host directory volume)
          Path:          /var/lock
          HostPathType:  
        kube-api-access:
          Type:                    Projected (a volume that contains injected data from multiple sources)
          TokenExpirationSeconds:  3600
          ConfigMapName:           kube-root-ca.crt
          ConfigMapOptional:       <nil>
          DownwardAPI:             true
      QoS Class:                   Guaranteed
      Node-Selectors:              <none>
      Tolerations:                 op=Exists

      Attachments

        Activity

          People

            svanka@redhat.com Sai Ramesh Vanka
            smalleni@redhat.com Sai Sindhur Malleni
            ying zhou ying zhou
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: