-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.19
bootstrap API server should be terminated only after API is HA, we should wait for API to be available on at least 2 master nodes, these are the steps:
-
- API is HA (api is available on 2+ master nodes)
-
- delete the bootstrap kube-apiserver manifests
-
- wait for the bootstrap API to be down
-
- delete all other static manifests
-
- mark the bootstrap process done
We should note the difference between a) the bootstrap node itself existing, and b) API being available on the bootstrap node. Today inside the cluster bootstrap, we remove the bootstrap API (b) as soon as two master nodes appear. This is what happens today on the bootstrap node:
a) create the static assets
b) wait for 2 master nodes to appear
c) remove the kube-apiserver from the bootstrap node
d) mark the bootstrap process as completed
But we already might have a time window where API is not available [starting from c, and until api is available on a master node].
cluster bootstrap executable is invoked here:
https://github.com/openshift/installer/blob/c534bb90b780ae488bc6ef7901e0f3f6273e2764/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template#L541
start --tear-down-early=false --asset-dir=/assets --required-pods="${REQUIRED_PODS}"
Then, cluster bootstrap removes the bootstrap API here: https://github.com/openshift/cluster-bootstrap/blob/bcd73a12a957ce3821bdfc0920751b8e3528dc98/pkg/start/start.go#L203-L209
but the wait for API to be HA is done here: https://github.com/openshift/installer/blob/c534bb90b780ae488bc6ef7901e0f3f6273e2764/data/data/bootstrap/files/usr/local/bin/report-progress.sh#L24
The wait should happen from within cluster-bootstrap, this PR moves the wait before cluster bootstrap tears down the bootstrap API/control plane