Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.17
Component/s: Monitoring
Labels:
- trt-standup

Regression:
Yes
Sprint:
MON Sprint 259
sprint_count:
1
Release Blocker:
Approved
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:
N/A
Release Note Type:
Release Note Not Required
Release Note Status:
Done
Target Version:

4.17.0
Target Backport Versions:

4.17.0, 4.16.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

This is a clone of issue ~~OCPBUGS-39133~~. The following is the description of the original issue:
—
Description of problem:

Debugging https://issues.redhat.com/browse/OCPBUGS-36808 (the Metrics API failing some of the disruption checks) and taking https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-cluster-monitoring-operator-2439-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1824454734052855808 as a reproducer of the issue, I think the Kube-aggregator is behind the problem.

According to the disruption checks which forward some relevant errors from the apiserver in the logs, looking at one of the new-connections check failures (from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-cluster-monitoring-operator-2439-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1824454734052855808/artifacts/e2e-aws-ovn-upgrade-2/openshift-e2e-test/artifacts/junit/backend-disruption_20240816-155051.json)

> 	"Aug 16 *16:43:17.672* - 2s    E backend-disruption-name/metrics-api-new-connections connection/new disruption/openshift-tests reason/DisruptionBegan request-audit-id/c62b7d32-856f-49de-86f5-1daed55326b2 backend-disruption-name/metrics-api-new-connections connection/new disruption/openshift-tests stopped responding to GET requests over new connections: error running request: 503 Service Unavailable: error trying to reach service: dial tcp 10.128.2.31:10250: connect: connection refused"

The "error trying to reach service" part comes from: https://github.com/kubernetes/kubernetes/blob/b3c725627b15bb69fca01b70848f3427aca4c3ef/staging/src/k8s.io/apimachinery/pkg/util/proxy/transport.go#L105, the apiserver failing to reach the metrics-server Pod, the problem is that the IP "10.128.2.31" corresponds to a Pod that was deleted some milliseconds before (as part of a node update/draining), as we can see in:

> 2024-08-16T16:19:43.087Z|00195|binding|INFO|openshift-monitoring_metrics-server-7b9d8c5ddb-dtsmr: Claiming 0a:58:0a:80:02:1f 10.128.2.31
...
I0816 *16:43:17.650083*    2240 kubelet.go:2453] "SyncLoop DELETE" source="api" pods=["openshift-monitoring/metrics-server-7b9d8c5ddb-dtsmr"]
...

The apiserver was using a stale IP to reach a Pod that no longer exists, even though a new Pod that had already replaced the other Pod (Metrics API backend runs on 2 Pods), some minutes before, was available.
According to OVN, a fresher IP 10.131.0.12 of that Pod was already in the endpoints at that time:

> I0816 16:40:24.711048    4651 lb_config.go:1018] Cluster endpoints for openshift-monitoring/metrics-server are: map[TCP/https:{10250 [10.128.2.31 10.131.0.12] []}]

*I think, when "10.128.2.31" failed, the apiserver should have fallen back to "10.131.0.12", maybe it waits for some time/retries before doing so, or maybe it wasn't even aware of "10.131.0.12"*

AFAIU, we have "--enable-aggregator-routing" set by default https://github.com/openshift/cluster-kube-apiserver-operator/blob/37df1b1f80d3be6036b9e31975ac42fcb21b6447/bindata/assets/config/defaultconfig.yaml#L101-L103 on the apiservers, so instead of forwarding to the metrics-server's service, apiserver directly reaches the Pods.

For that it keeps track of the relevant services and endpoints https://github.com/kubernetes/kubernetes/blob/ad8a5f5994c0949b5da4240006d938e533834987/staging/src/k8s.io/kube-aggregator/pkg/apiserver/resolvers.go#L40

bad decisions may be made if the if the services and/or endpoints cache are stale.

Looking at the metrics-server (the Metrics API backend) endpoints changes in the apiserver audit logs:

> $ grep -hr Event . | grep "endpoints/metrics-server" | jq -c 'select( .verb | match("watch|update"))' | jq -r '[.requestReceivedTimestamp,.user.username,.verb] | @tsv' | sort
2024-08-16T15:39:57.575468Z	system:serviceaccount:kube-system:endpoint-controller	update
2024-08-16T15:40:02.005051Z	system:serviceaccount:kube-system:endpoint-controller	update
2024-08-16T15:40:35.085330Z	system:serviceaccount:kube-system:endpoint-controller	update
2024-08-16T15:40:35.128519Z	system:serviceaccount:kube-system:endpoint-controller	update
2024-08-16T16:19:41.148148Z	system:serviceaccount:kube-system:endpoint-controller	update
2024-08-16T16:19:47.797420Z	system:serviceaccount:kube-system:endpoint-controller	update
2024-08-16T16:20:23.051594Z	system:serviceaccount:kube-system:endpoint-controller	update
2024-08-16T16:20:23.100761Z	system:serviceaccount:kube-system:endpoint-controller	update
2024-08-16T16:20:23.938927Z	system:serviceaccount:kube-system:endpoint-controller	update
2024-08-16T16:21:01.699722Z	system:serviceaccount:kube-system:endpoint-controller	update
2024-08-16T16:39:00.328312Z	system:serviceaccount:kube-system:endpoint-controller	update ==> At around 16:39:XX the first Pod was rolled out
2024-08-16T16:39:07.260823Z	system:serviceaccount:kube-system:endpoint-controller	update
2024-08-16T16:39:41.124449Z	system:serviceaccount:kube-system:endpoint-controller	update
2024-08-16T16:43:23.701015Z	system:serviceaccount:kube-system:endpoint-controller	update ==> At around 16:43:23, the new Pod that replaced the second one was created
2024-08-16T16:43:28.639793Z	system:serviceaccount:kube-system:endpoint-controller	update
2024-08-16T16:43:47.108903Z	system:serviceaccount:kube-system:endpoint-controller	update

We can see that just before the new-connections checks succeeded again at around "2024-08-16T16:43:23.", an UPDATE was received/treated which may have helped the apiserver sync its endpoints cache or/and chose a healthy Pod

Also, no update was triggered when the second Pod was deleted at "16:43:17" which may explain the stale 10.128.2.31 endpoints entry on apiserver side.

To summarize, I can see two problems here (maybe one is the consequence of the other):

    A Pod was deleted and an Endpoint pointing to it wasn't updated. Apparently the Endpoints controller had/has some sync issues https://github.com/kubernetes/kubernetes/issues/125638
    The apiserver resolver had a endpoints cache with one stale and one fresh entry but it kept 4-5 times in a row trying to reach the stale entry OR
    The endpoints was updated "At around 16:39:XX the first Pod was rolled out, see above", but the apiserver resolver cache missed that and ended up with 2 stale entries in the cache, and had to wait until "At around 16:43:23, the new Pod that replaced the second one was created, see above" to sync and replace them with 2 fresh entries.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

    1. See "Description of problem"
    2.
    3.

Actual results:

Expected results:

the kube-aggregator should detect stale Apiservice endpoints.

Additional info:

the kube-aggregator proxies requests to a stale Endpoints/Pod which makes Metrics API requests falsely fail.

blocks

OCPBUGS-41717 Kube-aggregator reaching stale apiservice endpoints

Closed

clones

OCPBUGS-39133 Kube-aggregator reaching stale apiservice endpoints

Closed

is blocked by

OCPBUGS-39133 Kube-aggregator reaching stale apiservice endpoints

Closed

is cloned by

OCPBUGS-41717 Kube-aggregator reaching stale apiservice endpoints

Closed

links to

openshift/cluster-monitoring-operator#2463: OCPBUGS-41580: Configure graceful shutdown for metrics-server (4.17 backport)

openshift/kubernetes-metrics-server#35: OCPBUGS-41580: Wire server run options to flags.

RHEA-2024:3718 OpenShift Container Platform 4.17.z bug fix update

(2 links to)

Assignee:: Ayoub Mrini

Reporter:: OpenShift Prow Bot

QA Contact:: Junqi Zhao

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2024/09/10 7:24 AM

Updated:: 2024/10/01 5:44 PM

Resolved:: 2024/10/01 5:44 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates