Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-14662

[IBMCloud] Master machine was replaced and stuck in Deleting, many csrs Pending

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Not a Bug
    • Icon: Undefined Undefined
    • None
    • 4.14
    • None
    • Important
    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      During IPI installation on IBM Cloud, one master machine was replaced and stuck in Deleting, worker node stuck in Provisioned status, many csr pending.

      Version-Release number of selected component (if applicable):

      4.14.0-0.nightly-2023-06-05-112833

      How reproducible:

      Met one time

      Steps to Reproduce:

      1. Create an IPI cluster on IBM Cloud
      2. 

      Actual results:

      IPI creation failed, one master machine was replaced and stuck in Deleting because of preDrain hook, 3 workers stuck in Provisioned, many csr pending.
      
      $ oc get machine -n openshift-machine-api   
      NAME                            PHASE         TYPE       REGION   ZONE      AGE
      zhsunibm-4mzf5-master-0         Deleting      bx2-4x16   eu-gb    eu-gb-1   5h53m
      zhsunibm-4mzf5-master-1         Running       bx2-4x16   eu-gb    eu-gb-2   5h53m
      zhsunibm-4mzf5-master-2         Running       bx2-4x16   eu-gb    eu-gb-3   5h53m
      zhsunibm-4mzf5-worker-1-sd8hj   Provisioned   bx2-4x16   eu-gb    eu-gb-1   4h7m
      zhsunibm-4mzf5-worker-2-wwdzt   Provisioned   bx2-4x16   eu-gb    eu-gb-2   4h7m
      zhsunibm-4mzf5-worker-3-945tn   Provisioned   bx2-4x16   eu-gb    eu-gb-3   4h6m
      
      $ oc get machine zhsunibm-4mzf5-master-0 -o yaml -n openshift-machine-api
      status:
        addresses:
        - address: zhsunibm-4mzf5-master-0
          type: InternalDNS
        - address: 10.242.0.8
          type: InternalIP
        conditions:
        - lastTransitionTime: "2023-06-07T01:53:33Z"
          message: 'Drain operation currently blocked by: [{Name:EtcdQuorumOperator Owner:clusteroperator/etcd}]'
          reason: HookPresent
          severity: Warning
          status: "False"
          type: Drainable
        - lastTransitionTime: "2023-06-07T01:52:03Z"
          status: "True"
          type: InstanceExists
        - lastTransitionTime: "2023-06-07T01:52:03Z"
          status: "True"
          type: Terminable
        lastUpdated: "2023-06-07T03:29:47Z"
        nodeRef:
          kind: Node
          name: zhsunibm-4mzf5-master-0
          uid: bf748d29-e4e4-492d-b82b-98a55822eab1
        phase: Deleting
        providerStatus:
          conditions:
          - lastProbeTime: "2023-06-07T01:52:03Z"
            lastTransitionTime: "2023-06-07T01:52:03Z"
            message: Machine successfully created
            reason: MachineCreationSucceeded
            status: "True"
            type: MachineCreated
          - lastProbeTime: "2023-06-07T01:55:30Z"
            lastTransitionTime: "2023-06-07T01:53:12Z"
            message: Machine replacement completed successfully
            reason: MachineReplacementCompleted
            status: "True"
            type: MachineReplacement
          instanceId: 0787_81242d30-f80c-47b9-a5a4-33ff0a1faaeb
          instanceState: running
      
      $ oc logs -f machine-api-controllers-686f9c947f-pxhsl -n openshift-machine-api -c machine-controller
      I0607 01:53:00.243661       1 controller.go:282] zhsunibm-4mzf5-master-0: reconciling machine triggers idempotent update
      I0607 01:53:00.243774       1 actuator.go:98] zhsunibm-4mzf5-master-0: Updating machine
      I0607 01:53:02.145241       1 reconciler.go:267] zhsunibm-4mzf5-master-0: checking if machine is past replacement deadline
      I0607 01:53:02.145981       1 reconciler.go:397] zhsunibm-4mzf5-master-0: machine is past 15 minute deadline
      W0607 01:53:02.146043       1 reconciler.go:275] zhsunibm-4mzf5-master-0: attempting to replace stuck machine
      I0607 01:53:02.146067       1 reconciler.go:277] zhsunibm-4mzf5-master-0: clearing machine's previous data for replacement machine
      I0607 01:53:02.146090       1 machine_scope.go:156] "zhsunibm-4mzf5-master-0": patching machine
      I0607 01:53:12.178998       1 reconciler.go:285] zhsunibm-4mzf5-master-0: updating provider status for replacement requested
      I0607 01:53:12.179109       1 conditions.go:45] Adding new provider condition {MachineReplacement True 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC MachineReplacementRequested Machine replacement requested}
      I0607 01:53:12.179144       1 machine_scope.go:156] "zhsunibm-4mzf5-master-0": patching machine
      I0607 01:53:22.212994       1 reconciler.go:297] zhsunibm-4mzf5-master-0: deleting machine for replacement
      I0607 01:53:23.697056       1 reconciler.go:158] zhsunibm-4mzf5-master-0: machine status is exists, requeuing...
      I0607 01:53:23.697098       1 reconciler.go:301] zhsunibm-4mzf5-master-0: machine delete call made successfully, for replacement
      
      $ oc get csr | grep Pending     
      csr-2kns8                                        3h38m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         <none>              Pending
      csr-2kp5l                                        4h54m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         <none>              Pending
      csr-2qzhl                                        170m    kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         <none>              Pending
      csr-4nndz                                        124m    kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         <none>              Pending
      csr-4vskz                                        4h23m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         <none>              Pending
      csr-52f7h                                        3h22m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         <none>              Pending
      
      $ oc logs -f machine-approver-584db5bcf7-rm6km -n openshift-cluster-machine-approver -c machine-approver-controller  
      Error from server: Get "https://10.242.0.8:10250/containerLogs/openshift-cluster-machine-approver/machine-approver-584db5bcf7-rm6km/machine-approver-controller?follow=true": remote error: tls: internal error

      Expected results:

      Successful IPI creation on IBM Cloud

      Additional info:

      must-gather: https://drive.google.com/file/d/1Mnfy48NJFQw5wG6hyeZsuS1tpYrT7-aV/view?usp=sharing
      Replaced machine related to this bug https://issues.redhat.com/browse/OCPBUGS-1327
      csr related https://issues.redhat.com/browse/OCPBUGS-8349 ?

            jeffbnowicki Jeff Nowicki
            rhn-support-zhsun Zhaohua Sun
            Zhaohua Sun Zhaohua Sun
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: