-
Bug
-
Resolution: Done
-
Critical
-
None
-
4.14
-
Quality / Stability / Reliability
-
False
-
-
3
-
Critical
-
No
-
None
-
None
-
Proposed
-
Metal Platform 240
-
1
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
OCP 4.14 deployment on real BM (HPE) setup using redfish-virtualmedia - disconnected is constantly failing to complete bootstrapping. From installation log
time="2023-07-31T17:14:00+03:00" level=debug msg="ironic_deployment.openshift-master-deployment[0]: Creation complete after 3m30s [id=67c63134-d7ef-4175-abd1-62e5759ca0e4]" time="2023-07-31T17:14:09+03:00" level=debug msg="ironic_deployment.openshift-master-deployment[2]: Still creating... [3m40s elapsed]" time="2023-07-31T17:14:09+03:00" level=debug msg="ironic_deployment.openshift-master-deployment[1]: Still creating... [3m40s elapsed]" time="2023-07-31T17:14:19+03:00" level=debug msg="ironic_deployment.openshift-master-deployment[2]: Still creating... [3m50s elapsed]" time="2023-07-31T17:14:19+03:00" level=debug msg="ironic_deployment.openshift-master-deployment[1]: Still creating... [3m50s elapsed]" time="2023-07-31T17:14:29+03:00" level=debug msg="ironic_deployment.openshift-master-deployment[1]: Still creating... [4m0s elapsed]" time="2023-07-31T17:14:29+03:00" level=debug msg="ironic_deployment.openshift-master-deployment[2]: Still creating... [4m0s elapsed]" time="2023-07-31T17:14:30+03:00" level=debug msg="ironic_deployment.openshift-master-deployment[1]: Creation complete after 4m0s [id=de684789-785e-4e22-be7b-c904d8b90d6e]" time="2023-07-31T17:14:30+03:00" level=debug msg="ironic_deployment.openshift-master-deployment[2]: Creation complete after 4m0s [id=e5d2be79-f6b5-4ed7-9de5-f52cf79800df]" time="2023-07-31T17:14:30+03:00" level=debug time="2023-07-31T17:14:30+03:00" level=debug msg="Apply complete! Resources: 6 added, 0 changed, 0 destroyed." time="2023-07-31T17:14:30+03:00" level=debug msg="OpenShift Installer 4.14.0-0.nightly-2023-07-30-234232" time="2023-07-31T17:14:30+03:00" level=debug msg="Built from commit e044bf5e410b27ec745cd72aba5b5f67e47cd14e" time="2023-07-31T17:14:30+03:00" level=info msg="Waiting up to 20m0s (until 5:34PM IDT) for the Kubernetes API at https://api.ocp-edge1.lab.eng.tlv2.redhat.com:6443..." time="2023-07-31T17:14:30+03:00" level=debug msg="Loading Agent Config..." time="2023-07-31T17:14:31+03:00" level=info msg="API v1.27.3+4aaeaec up" time="2023-07-31T17:14:31+03:00" level=debug msg="Loading Install Config..." time="2023-07-31T17:14:31+03:00" level=debug msg=" Loading SSH Key..." time="2023-07-31T17:14:31+03:00" level=debug msg=" Loading Base Domain..." time="2023-07-31T17:14:31+03:00" level=debug msg=" Loading Platform..." time="2023-07-31T17:14:31+03:00" level=debug msg=" Loading Cluster Name..." time="2023-07-31T17:14:31+03:00" level=debug msg=" Loading Base Domain..." time="2023-07-31T17:14:31+03:00" level=debug msg=" Loading Platform..." time="2023-07-31T17:14:31+03:00" level=debug msg=" Loading Networking..." time="2023-07-31T17:14:31+03:00" level=debug msg=" Loading Platform..." time="2023-07-31T17:14:31+03:00" level=debug msg=" Loading Pull Secret..." time="2023-07-31T17:14:31+03:00" level=debug msg=" Loading Platform..." time="2023-07-31T17:14:31+03:00" level=debug msg="Using Install Config loaded from state file" time="2023-07-31T17:14:31+03:00" level=info msg="Waiting up to 1h0m0s (until 6:14PM IDT) for bootstrapping to complete..." time="2023-07-31T18:14:31+03:00" level=debug msg="Fetching Bootstrap SSH Key Pair..." time="2023-07-31T18:14:31+03:00" level=debug msg="Loading Bootstrap SSH Key Pair..." time="2023-07-31T18:14:31+03:00" level=debug msg="Using Bootstrap SSH Key Pair loaded from state file" time="2023-07-31T18:14:31+03:00" level=debug msg="Reusing previously-fetched Bootstrap SSH Key Pair" time="2023-07-31T18:14:31+03:00" level=debug msg="Fetching Install Config..." time="2023-07-31T18:14:31+03:00" level=debug msg="Loading Install Config..." time="2023-07-31T18:14:31+03:00" level=debug msg=" Loading SSH Key..." time="2023-07-31T18:14:31+03:00" level=debug msg=" Loading Base Domain..." time="2023-07-31T18:14:31+03:00" level=debug msg=" Loading Platform..." time="2023-07-31T18:14:31+03:00" level=debug msg=" Loading Cluster Name..." time="2023-07-31T18:14:31+03:00" level=debug msg=" Loading Base Domain..." time="2023-07-31T18:14:31+03:00" level=debug msg=" Loading Platform..." time="2023-07-31T18:14:31+03:00" level=debug msg=" Loading Networking..." time="2023-07-31T18:14:31+03:00" level=debug msg=" Loading Platform..." time="2023-07-31T18:14:31+03:00" level=debug msg=" Loading Pull Secret..." time="2023-07-31T18:14:31+03:00" level=debug msg=" Loading Platform..." time="2023-07-31T18:14:31+03:00" level=debug msg="Using Install Config loaded from state file" time="2023-07-31T18:14:31+03:00" level=debug msg="Reusing previously-fetched Install Config" time="2023-07-31T18:14:31+03:00" level=error msg="Attempted to gather debug logs after installation failure: must provide bootstrap host address" time="2023-07-31T18:14:31+03:00" level=error msg="Bootstrap failed to complete: timed out waiting for the condition" time="2023-07-31T18:14:31+03:00" level=error msg="Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane."
On bootstrap during the installation I see all ironic containers are up and running and stopped when finished their work. Only container that is running and restarting is
$ sudo podman ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 4277ea0e7228 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:89f62fb5941a5807622a796519e4bb3a94808b14c15dbbfe094b1aee3e1f2d6d start --tear-down... 16 minutes ago Up 16 minutes cluster-bootstrap
The container log http://pastebin.test.redhat.com/1106364
The state of the cluster:
$ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication baremetal cloud-controller-manager cloud-credential True False False 84m cluster-autoscaler config-operator console control-plane-machine-set csi-snapshot-controller dns etcd image-registry ingress insights kube-apiserver kube-controller-manager kube-scheduler kube-storage-version-migrator machine-api machine-approver machine-config marketplace monitoring network node-tuning openshift-apiserver openshift-controller-manager openshift-samples operator-lifecycle-manager operator-lifecycle-manager-catalog operator-lifecycle-manager-packageserver service-ca storage
$ oc get bmh -n openshift-machine-api NAME STATE CONSUMER ONLINE ERROR AGE openshift-master-0 ocp-edge1-zwzcn-master-0 true 16h openshift-master-1 ocp-edge1-zwzcn-master-1 true 16h openshift-master-2 ocp-edge1-zwzcn-master-2 true 16h openshift-worker-0 true 16h openshift-worker-1 true 16h $ oc get machine -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE ocp-edge1-zwzcn-master-0 16h ocp-edge1-zwzcn-master-1 16h ocp-edge1-zwzcn-master-2 16h
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-07-30-234232 and earlier images as well. We have this failure from the last week
How reproducible:
100%
Steps to Reproduce:
1. Deploy OCP 4.14 on real bm setup with virtualmedia 2. 3.
Actual results:
see above
Expected results:
deployment should pass and cluster should be ready
Additional info:
will provide boostrap log-bundel Deployment of 4.13 on the same setup passed. Also on VM simulation the deployment is passing Was running and analizing deploy with baremetal ipv4. Will try with ipv6 and add info here