Loading...

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: 4.18.0
Affects Version/s: 4.17
Component/s: Monitoring
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

TRT has a discussion. Given the graph clearly shows the 4s+ regression from 4.16, we are marking this as release blocker. Please help prioritize diagnosing/fixing the issue. Feel free to contact us if any extra info is needed.

—
Blocker rejected from monitoring side, see https://issues.redhat.com/browse/OCPBUGS-39133?focusedId=25516461&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-25516461

Show
TRT has a discussion. Given the graph clearly shows the 4s+ regression from 4.16, we are marking this as release blocker. Please help prioritize diagnosing/fixing the issue. Feel free to contact us if any extra info is needed. — Blocker rejected from monitoring side, see https://issues.redhat.com/browse/OCPBUGS-39133?focusedId=25516461&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-25516461
Story Points:
None
Severity:
None
Regression:
Yes

Target Backport Versions:

4.17.0, 4.16.z
Target Version:

4.18.0
Release Blocker:
Approved
Sprint:
MON Sprint 259
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
In Progress
Release Note Type:
Release Note Not Required
Release Note Text:
N/A

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Debugging https://issues.redhat.com/browse/OCPBUGS-36808 (the Metrics API failing some of the disruption checks) and taking https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-cluster-monitoring-operator-2439-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1824454734052855808 as a reproducer of the issue, I think the Kube-aggregator is behind the problem.

According to the disruption checks which forward some relevant errors from the apiserver in the logs, looking at one of the new-connections check failures (from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-cluster-monitoring-operator-2439-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1824454734052855808/artifacts/e2e-aws-ovn-upgrade-2/openshift-e2e-test/artifacts/junit/backend-disruption_20240816-155051.json)

> "Aug 16 16:43:17.672 - 2s E backend-disruption-name/metrics-api-new-connections connection/new disruption/openshift-tests reason/DisruptionBegan request-audit-id/c62b7d32-856f-49de-86f5-1daed55326b2 backend-disruption-name/metrics-api-new-connections connection/new disruption/openshift-tests stopped responding to GET requests over new connections: error running request: 503 Service Unavailable: error trying to reach service: dial tcp 10.128.2.31:10250: connect: connection refused"

The "error trying to reach service" part comes from: https://github.com/kubernetes/kubernetes/blob/b3c725627b15bb69fca01b70848f3427aca4c3ef/staging/src/k8s.io/apimachinery/pkg/util/proxy/transport.go#L105, the apiserver failing to reach the metrics-server Pod, the problem is that the IP "10.128.2.31" corresponds to a Pod that was deleted some milliseconds before (as part of a node update/draining), as we can see in:

> 2024-08-16T16:19:43.087Z|00195|binding|INFO|openshift-monitoring_metrics-server-7b9d8c5ddb-dtsmr: Claiming 0a:58:0a:80:02:1f 10.128.2.31
...
I0816 16:43:17.650083 2240 kubelet.go:2453] "SyncLoop DELETE" source="api" pods=["openshift-monitoring/metrics-server-7b9d8c5ddb-dtsmr"]
...

The apiserver was using a stale IP to reach a Pod that no longer exists, even though a new Pod that had already replaced the other Pod (Metrics API backend runs on 2 Pods), some minutes before, was available.
According to OVN, a fresher IP 10.131.0.12 of that Pod was already in the endpoints at that time:

> I0816 16:40:24.711048 4651 lb_config.go:1018] Cluster endpoints for openshift-monitoring/metrics-server are: map[TCP/https:

{10250 [10.128.2.31 10.131.0.12] []}

]

I think, when "10.128.2.31" failed, the apiserver should have fallen back to "10.131.0.12", maybe it waits for some time/retries before doing so, or maybe it wasn't even aware of "10.131.0.12"

AFAIU, we have "--enable-aggregator-routing" set by default https://github.com/openshift/cluster-kube-apiserver-operator/blob/37df1b1f80d3be6036b9e31975ac42fcb21b6447/bindata/assets/config/defaultconfig.yaml#L101-L103 on the apiservers, so instead of forwarding to the metrics-server's service, apiserver directly reaches the Pods.

For that it keeps track of the relevant services and endpoints https://github.com/kubernetes/kubernetes/blob/ad8a5f5994c0949b5da4240006d938e533834987/staging/src/k8s.io/kube-aggregator/pkg/apiserver/resolvers.go#L40

bad decisions may be made if the if the services and/or endpoints cache are stale.

Looking at the metrics-server (the Metrics API backend) endpoints changes in the apiserver audit logs:

> $ grep -hr Event . | grep "endpoints/metrics-server" | jq -c 'select( .verb | match("watch|update"))' | jq -r '[.requestReceivedTimestamp,.user.username,.verb] | @tsv' | sort
2024-08-16T15:39:57.575468Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T15:40:02.005051Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T15:40:35.085330Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T15:40:35.128519Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:19:41.148148Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:19:47.797420Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:20:23.051594Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:20:23.100761Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:20:23.938927Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:21:01.699722Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:39:00.328312Z system:serviceaccount:kube-system:endpoint-controller update ==> At around 16:39:XX the first Pod was rolled out
2024-08-16T16:39:07.260823Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:39:41.124449Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:43:23.701015Z system:serviceaccount:kube-system:endpoint-controller update ==> At around 16:43:23, the new Pod that replaced the second one was created
2024-08-16T16:43:28.639793Z system:serviceaccount:kube-system:endpoint-controller update
2024-08-16T16:43:47.108903Z system:serviceaccount:kube-system:endpoint-controller update

We can see that just before the new-connections checks succeeded again at around "2024-08-16T16:43:23.", an UPDATE was received/treated which may have helped the apiserver sync its endpoints cache or/and chose a healthy Pod

Also, no update was triggered when the second Pod was deleted at "16:43:17" which may explain the stale 10.128.2.31 endpoints entry on apiserver side.

To summarize, I can see two problems here (maybe one is the consequence of the other):

A Pod was deleted and an Endpoint pointing to it wasn't updated. Apparently the Endpoints controller had/has some sync issues https://github.com/kubernetes/kubernetes/issues/125638
The apiserver resolver had a endpoints cache with one stale and one fresh entry but it kept 4-5 times in a row trying to reach the stale entry OR
The endpoints was updated "At around 16:39:XX the first Pod was rolled out, see above", but the apiserver resolver cache missed that and ended up with 2 stale entries in the cache, and had to wait until "At around 16:43:23, the new Pod that replaced the second one was created, see above" to sync and replace them with 2 fresh entries.

Version-Release number of selected component (if applicable):
{code:none}

How reproducible:

Steps to Reproduce:

    1. See "Description of problem"
    2.
    3.

Actual results:

Expected results:

the kube-aggregator should detect stale Apiservice endpoints.

Additional info:

the kube-aggregator proxies requests to a stale Endpoints/Pod which makes Metrics API requests falsely fail.

blocks

OCPBUGS-41580 Kube-aggregator reaching stale apiservice endpoints

Closed

is cloned by

OCPBUGS-41580 Kube-aggregator reaching stale apiservice endpoints

Closed

is triggered by

OCPBUGS-36808 Metrics API disruption regressed during node reboot with aws ovn upgrade job

Closed

links to

https://github.com/kubernetes-sigs/metrics-server/pull/1560

openshift/cluster-monitoring-operator#2460: OCPBUGS-39133: Configure graceful shutdown for metrics-server

openshift/kubernetes-metrics-server#34: OCPBUGS-39133: Add graceful shutdown flags

RHEA-2024:6122 OpenShift Container Platform 4.18.z bug fix update

(2 links to)

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide