-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
4.17
Debugging https://issues.redhat.com/browse/OCPBUGS-36808 (the Metrics API failing some of the disruption checks) and taking https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-cluster-monitoring-operator-2439-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1824454734052855808 as a reproducer of the issue, I think the Kube-aggregator is behind the problem.
According to the disruption checks which forward some relevant errors from the apiserver in the logs, looking at one of the new-connections check failures (from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-cluster-monitoring-operator-2439-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1824454734052855808/artifacts/e2e-aws-ovn-upgrade-2/openshift-e2e-test/artifacts/junit/backend-disruption_20240816-155051.json)
> "Aug 16 16:43:17.672 - 2s E backend-disruption-name/metrics-api-new-connections connection/new disruption/openshift-tests reason/DisruptionBegan request-audit-id/c62b7d32-856f-49de-86f5-1daed55326b2 backend-disruption-name/metrics-api-new-connections connection/new disruption/openshift-tests stopped responding to GET requests over new connections: error running request: 503 Service Unavailable: error trying to reach service: dial tcp 10.128.2.31:10250: connect: connection refused"
The "error trying to reach service" part comes from: https://github.com/kubernetes/kubernetes/blob/b3c725627b15bb69fca01b70848f3427aca4c3ef/staging/src/k8s.io/apimachinery/pkg/util/proxy/transport.go#L105, the apiserver failing to reach the metrics-server Pod, the problem is that the IP "10.128.2.31" corresponds to a Pod that was deleted some milliseconds before (as part of a node update/draining), as we can see in:
> 2024-08-16T16:19:43.087Z|00195|binding|INFO|openshift-monitoring_metrics-server-7b9d8c5ddb-dtsmr: Claiming 0a:58:0a:80:02:1f 10.128.2.31
...
I0816 16:43:17.650083 2240 kubelet.go:2453] "SyncLoop DELETE" source="api" pods=["openshift-monitoring/metrics-server-7b9d8c5ddb-dtsmr"]
...
The apiserver was using a stale IP to reach a Pod that no longer exists, even though a new Pod that had already replaced the other Pod (Metrics API backend runs on 2 Pods), some minutes before, was available.
According to OVN, a fresher IP 10.131.0.12 of that Pod was already in the endpoints at that time:
> I0816 16:40:24.711048 4651 lb_config.go:1018] Cluster endpoints for openshift-monitoring/metrics-server are: map[TCP/https:
{10250 [10.128.2.31 10.131.0.12] []}]
I think, when "10.128.2.31" failed, the apiserver should have fallen back to "10.131.0.12", maybe it waits for some time/retries before doing so, or maybe it wasn't even aware of "10.131.0.12"
AFAIU, we have "--enable-aggregator-routing" set by default https://github.com/openshift/cluster-kube-apiserver-operator/blob/37df1b1f80d3be6036b9e31975ac42fcb21b6447/bindata/assets/config/defaultconfig.yaml#L101-L103 on the apiservers, so instead of forwarding to the metrics-server's service, apiserver directly reaches the Pods.
For that it keeps track of the relevant services and endpoints https://github.com/kubernetes/kubernetes/blob/ad8a5f5994c0949b5da4240006d938e533834987/staging/src/k8s.io/kube-aggregator/pkg/apiserver/resolvers.go#L40
bad decisions may be made if the if the services and/or endpoints cache are stale.
Looking at the metrics-server (the Metrics API backend) endpoints changes in the apiserver audit logs:
> $ grep -hr Event . | grep "endpoints/metrics-server" | jq -c 'select( .verb | match("watch|update"))' | jq -r '[.requestReceivedTimestamp,.user.username,.verb] | @tsv' | sort
2024-08-16T15:39:57.575468Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T15:40:02.005051Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T15:40:35.085330Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T15:40:35.128519Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:19:41.148148Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:19:47.797420Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:20:23.051594Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:20:23.100761Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:20:23.938927Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:21:01.699722Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:39:00.328312Z system:serviceaccount:kube-system:endpoint-controller update ==> At around 16:39:XX the first Pod was rolled out
2024-08-16T16:39:07.260823Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:39:41.124449Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:43:23.701015Z system:serviceaccount:kube-system:endpoint-controller update ==> At around 16:43:23, the new Pod that replaced the second one was created
2024-08-16T16:43:28.639793Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:43:47.108903Z system:serviceaccount:kube-system:endpoint-controller update
We can see that just before the new-connections checks succeeded again at around "2024-08-16T16:43:23.", an UPDATE was received/treated which may have helped the apiserver sync its endpoints cache or/and chose a healthy Pod
Also, no update was triggered when the second Pod was deleted at "16:43:17" which may explain the stale 10.128.2.31 endpoints entry on apiserver side.
To summarize, I can see two problems here (maybe one is the consequence of the other):
A Pod was deleted and an Endpoint pointing to it wasn't updated. Apparently the Endpoints controller had/has some sync issues https://github.com/kubernetes/kubernetes/issues/125638
The apiserver resolver had a endpoints cache with one stale and one fresh entry but it kept 4-5 times in a row trying to reach the stale entry OR
The endpoints was updated "At around 16:39:XX the first Pod was rolled out, see above", but the apiserver resolver cache missed that and ended up with 2 stale entries in the cache, and had to wait until "At around 16:43:23, the new Pod that replaced the second one was created, see above" to sync and replace them with 2 fresh entries.
Version-Release number of selected component (if applicable):
{code:none}
How reproducible:
Steps to Reproduce:
1. See "Description of problem" 2. 3.
Actual results:
Expected results:
the kube-aggregator should detect stale Apiservice endpoints.
Additional info:
the kube-aggregator proxies requests to a stale Endpoints/Pod which makes Metrics API requests falsely fail.
- blocks
-
OCPBUGS-41580 Kube-aggregator reaching stale apiservice endpoints
- Closed
- is cloned by
-
OCPBUGS-41580 Kube-aggregator reaching stale apiservice endpoints
- Closed
- is triggered by
-
OCPBUGS-36808 Metrics API disruption regressed during node reboot with aws ovn upgrade job
- Closed
- links to
-
RHEA-2024:6122 OpenShift Container Platform 4.18.z bug fix update