-
Bug
-
Resolution: Not a Bug
-
Undefined
-
None
-
4.14
-
None
-
Important
-
No
-
False
-
Description of problem:
During IPI installation on IBM Cloud, one master machine was replaced and stuck in Deleting, worker node stuck in Provisioned status, many csr pending.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-06-05-112833
How reproducible:
Met one time
Steps to Reproduce:
1. Create an IPI cluster on IBM Cloud 2.
Actual results:
IPI creation failed, one master machine was replaced and stuck in Deleting because of preDrain hook, 3 workers stuck in Provisioned, many csr pending. $ oc get machine -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE zhsunibm-4mzf5-master-0 Deleting bx2-4x16 eu-gb eu-gb-1 5h53m zhsunibm-4mzf5-master-1 Running bx2-4x16 eu-gb eu-gb-2 5h53m zhsunibm-4mzf5-master-2 Running bx2-4x16 eu-gb eu-gb-3 5h53m zhsunibm-4mzf5-worker-1-sd8hj Provisioned bx2-4x16 eu-gb eu-gb-1 4h7m zhsunibm-4mzf5-worker-2-wwdzt Provisioned bx2-4x16 eu-gb eu-gb-2 4h7m zhsunibm-4mzf5-worker-3-945tn Provisioned bx2-4x16 eu-gb eu-gb-3 4h6m $ oc get machine zhsunibm-4mzf5-master-0 -o yaml -n openshift-machine-api status: addresses: - address: zhsunibm-4mzf5-master-0 type: InternalDNS - address: 10.242.0.8 type: InternalIP conditions: - lastTransitionTime: "2023-06-07T01:53:33Z" message: 'Drain operation currently blocked by: [{Name:EtcdQuorumOperator Owner:clusteroperator/etcd}]' reason: HookPresent severity: Warning status: "False" type: Drainable - lastTransitionTime: "2023-06-07T01:52:03Z" status: "True" type: InstanceExists - lastTransitionTime: "2023-06-07T01:52:03Z" status: "True" type: Terminable lastUpdated: "2023-06-07T03:29:47Z" nodeRef: kind: Node name: zhsunibm-4mzf5-master-0 uid: bf748d29-e4e4-492d-b82b-98a55822eab1 phase: Deleting providerStatus: conditions: - lastProbeTime: "2023-06-07T01:52:03Z" lastTransitionTime: "2023-06-07T01:52:03Z" message: Machine successfully created reason: MachineCreationSucceeded status: "True" type: MachineCreated - lastProbeTime: "2023-06-07T01:55:30Z" lastTransitionTime: "2023-06-07T01:53:12Z" message: Machine replacement completed successfully reason: MachineReplacementCompleted status: "True" type: MachineReplacement instanceId: 0787_81242d30-f80c-47b9-a5a4-33ff0a1faaeb instanceState: running $ oc logs -f machine-api-controllers-686f9c947f-pxhsl -n openshift-machine-api -c machine-controller I0607 01:53:00.243661 1 controller.go:282] zhsunibm-4mzf5-master-0: reconciling machine triggers idempotent update I0607 01:53:00.243774 1 actuator.go:98] zhsunibm-4mzf5-master-0: Updating machine I0607 01:53:02.145241 1 reconciler.go:267] zhsunibm-4mzf5-master-0: checking if machine is past replacement deadline I0607 01:53:02.145981 1 reconciler.go:397] zhsunibm-4mzf5-master-0: machine is past 15 minute deadline W0607 01:53:02.146043 1 reconciler.go:275] zhsunibm-4mzf5-master-0: attempting to replace stuck machine I0607 01:53:02.146067 1 reconciler.go:277] zhsunibm-4mzf5-master-0: clearing machine's previous data for replacement machine I0607 01:53:02.146090 1 machine_scope.go:156] "zhsunibm-4mzf5-master-0": patching machine I0607 01:53:12.178998 1 reconciler.go:285] zhsunibm-4mzf5-master-0: updating provider status for replacement requested I0607 01:53:12.179109 1 conditions.go:45] Adding new provider condition {MachineReplacement True 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC MachineReplacementRequested Machine replacement requested} I0607 01:53:12.179144 1 machine_scope.go:156] "zhsunibm-4mzf5-master-0": patching machine I0607 01:53:22.212994 1 reconciler.go:297] zhsunibm-4mzf5-master-0: deleting machine for replacement I0607 01:53:23.697056 1 reconciler.go:158] zhsunibm-4mzf5-master-0: machine status is exists, requeuing... I0607 01:53:23.697098 1 reconciler.go:301] zhsunibm-4mzf5-master-0: machine delete call made successfully, for replacement $ oc get csr | grep Pending csr-2kns8 3h38m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-2kp5l 4h54m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-2qzhl 170m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-4nndz 124m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-4vskz 4h23m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-52f7h 3h22m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending $ oc logs -f machine-approver-584db5bcf7-rm6km -n openshift-cluster-machine-approver -c machine-approver-controller Error from server: Get "https://10.242.0.8:10250/containerLogs/openshift-cluster-machine-approver/machine-approver-584db5bcf7-rm6km/machine-approver-controller?follow=true": remote error: tls: internal error
Expected results:
Successful IPI creation on IBM Cloud
Additional info:
must-gather: https://drive.google.com/file/d/1Mnfy48NJFQw5wG6hyeZsuS1tpYrT7-aV/view?usp=sharing Replaced machine related to this bug https://issues.redhat.com/browse/OCPBUGS-1327 csr related https://issues.redhat.com/browse/OCPBUGS-8349 ?