-
Bug
-
Resolution: Done-Errata
-
Undefined
-
4.14, 4.15
hypershift#1614 gave us the router Deployment (descended from the private-router Deployment), but it lacks PDB coverage. For example:
$ git --no-pager log -1 --oneline origin/main f3f421bc7 (origin/release-4.16, origin/release-4.15, origin/main, origin/HEAD) Merge pull request #3183 from muraee/azure-kms $ git --no-pager grep 'func [^(]*\(Deployment\|PodDisruptionBudget\)' f3f421bc7 -- control-plane-operator/controllers/hostedcontrolplane/{ingress,kas} f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/ingress/router.go:func ReconcileRouterDeployment(deployment *appsv1.Deployment, ownerRef config.OwnerRef, deploymentConfig config.DeploymentConfig, image string, config *corev1.ConfigMap) error { f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/deployment.go:func ReconcileKubeAPIServerDeployment(deployment *appsv1.Deployment, f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/pdb.go:func ReconcilePodDisruptionBudget(pdb *policyv1.PodDisruptionBudget, p *KubeAPIServerParams) error {
Both the ingress and kas packages have Reconcile*Deployment methods. Only kas has a ReconcilePodDisruptionBudget method.
This bug is asking for router to get a covering PDB too, because being able to simultaneously evict all router-* pods simultaneously (for the cluster flavors that have replicas > 1 on that Deployment) can make the incoming traffic unreachable. And some of that Route traffic looks like stuff that folks would want to be reliably reachable:
$ git --no-pager grep 'func Reconcile[^(]*Route(' f3f421bc7 -- control-plane-operator/controllers/hostedcontrolplane/{ingress,kas} f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/service.go:func ReconcileExternalPublicRoute(route *routev1.Route, owner *metav1.OwnerReference, hostname string) error { f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/service.go:func ReconcileExternalPrivateRoute(route *routev1.Route, owner *metav1.OwnerReference, hostname string) error { f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/service.go:func ReconcileInternalRoute(route *routev1.Route, owner *metav1.OwnerReference) error { f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/service.go:func ReconcileKonnectivityExternalRoute(route *routev1.Route, ownerRef config.OwnerRef, hostname string, defaultIngressDomain string) error { f3f421bc7:control-plane-operator/controllers/hostedcontrolplane/kas/service.go:func ReconcileKonnectivityInternalRoute(route *routev1.Route, ownerRef config.OwnerRef) error {
Test plan:
1. Install a hosted cluster.
2. Log into the managment cluster, and find the namespace of the hosted cluster $NAMESPACE.
3. Evict both router pods (using a raw create, because there isn't more convenient syntax yet):
oc -n "${NAMESPACE}" get -l app=private-router -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' pods | while read NAME do oc create -f - <<EOF --raw "/api/v1/namespaces/${NAMESPACE}/pods/${NAME}/eviction" {"apiVersion": "policy/v1", "kind": "Eviction", "metadata": {"name": "${NAME}"}} EOF done
If that clears out both router pods right after the other, ingress will probably hiccup. And with the PDB in place, I'd expect the second eviction to fail.
- relates to
-
HOSTEDCP-485 Use per-hcp router for all ingress
- Closed
- links to
-
RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update
- mentioned on