Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-39133

Kube-aggregator reaching stale apiservice endpoints

XMLWordPrintable

    • Yes
    • MON Sprint 259
    • 1
    • Approved
    • False
    • Hide

      TRT has a discussion. Given the graph clearly shows the 4s+ regression from 4.16, we are marking this as release blocker. Please help prioritize diagnosing/fixing the issue. Feel free to contact us if any extra info is needed.   


      Blocker rejected from monitoring side, see https://issues.redhat.com/browse/OCPBUGS-39133?focusedId=25516461&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-25516461

      Show
      TRT has a discussion. Given the graph clearly shows the 4s+ regression from 4.16, we are marking this as release blocker. Please help prioritize diagnosing/fixing the issue. Feel free to contact us if any extra info is needed.    — Blocker rejected from monitoring side, see https://issues.redhat.com/browse/OCPBUGS-39133?focusedId=25516461&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-25516461

      Debugging https://issues.redhat.com/browse/OCPBUGS-36808 (the Metrics API failing some of the disruption checks) and taking https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-cluster-monitoring-operator-2439-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1824454734052855808 as a reproducer of the issue, I think the Kube-aggregator is behind the problem.

      According to the disruption checks which forward some relevant errors from the apiserver in the logs, looking at one of the new-connections check failures (from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-cluster-monitoring-operator-2439-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1824454734052855808/artifacts/e2e-aws-ovn-upgrade-2/openshift-e2e-test/artifacts/junit/backend-disruption_20240816-155051.json)

      > "Aug 16 16:43:17.672 - 2s E backend-disruption-name/metrics-api-new-connections connection/new disruption/openshift-tests reason/DisruptionBegan request-audit-id/c62b7d32-856f-49de-86f5-1daed55326b2 backend-disruption-name/metrics-api-new-connections connection/new disruption/openshift-tests stopped responding to GET requests over new connections: error running request: 503 Service Unavailable: error trying to reach service: dial tcp 10.128.2.31:10250: connect: connection refused"

      The "error trying to reach service" part comes from: https://github.com/kubernetes/kubernetes/blob/b3c725627b15bb69fca01b70848f3427aca4c3ef/staging/src/k8s.io/apimachinery/pkg/util/proxy/transport.go#L105, the apiserver failing to reach the metrics-server Pod, the problem is that the IP "10.128.2.31" corresponds to a Pod that was deleted some milliseconds before (as part of a node update/draining), as we can see in:

      > 2024-08-16T16:19:43.087Z|00195|binding|INFO|openshift-monitoring_metrics-server-7b9d8c5ddb-dtsmr: Claiming 0a:58:0a:80:02:1f 10.128.2.31
      ...
      I0816 16:43:17.650083 2240 kubelet.go:2453] "SyncLoop DELETE" source="api" pods=["openshift-monitoring/metrics-server-7b9d8c5ddb-dtsmr"]
      ...

      The apiserver was using a stale IP to reach a Pod that no longer exists, even though a new Pod that had already replaced the other Pod (Metrics API backend runs on 2 Pods), some minutes before, was available.
      According to OVN, a fresher IP 10.131.0.12 of that Pod was already in the endpoints at that time:

      > I0816 16:40:24.711048 4651 lb_config.go:1018] Cluster endpoints for openshift-monitoring/metrics-server are: map[TCP/https:

      {10250 [10.128.2.31 10.131.0.12] []}

      ]

      I think, when "10.128.2.31" failed, the apiserver should have fallen back to "10.131.0.12", maybe it waits for some time/retries before doing so, or maybe it wasn't even aware of "10.131.0.12"

      AFAIU, we have "--enable-aggregator-routing" set by default https://github.com/openshift/cluster-kube-apiserver-operator/blob/37df1b1f80d3be6036b9e31975ac42fcb21b6447/bindata/assets/config/defaultconfig.yaml#L101-L103 on the apiservers, so instead of forwarding to the metrics-server's service, apiserver directly reaches the Pods.

      For that it keeps track of the relevant services and endpoints https://github.com/kubernetes/kubernetes/blob/ad8a5f5994c0949b5da4240006d938e533834987/staging/src/k8s.io/kube-aggregator/pkg/apiserver/resolvers.go#L40

      bad decisions may be made if the if the services and/or endpoints cache are stale.

      Looking at the metrics-server (the Metrics API backend) endpoints changes in the apiserver audit logs:

      > $ grep -hr Event . | grep "endpoints/metrics-server" | jq -c 'select( .verb | match("watch|update"))' | jq -r '[.requestReceivedTimestamp,.user.username,.verb] | @tsv' | sort
      2024-08-16T15:39:57.575468Z system:serviceaccount:kube-system:endpoint-controller update
      2024-08-16T15:40:02.005051Z system:serviceaccount:kube-system:endpoint-controller update
      2024-08-16T15:40:35.085330Z system:serviceaccount:kube-system:endpoint-controller update
      2024-08-16T15:40:35.128519Z system:serviceaccount:kube-system:endpoint-controller update
      2024-08-16T16:19:41.148148Z system:serviceaccount:kube-system:endpoint-controller update
      2024-08-16T16:19:47.797420Z system:serviceaccount:kube-system:endpoint-controller update
      2024-08-16T16:20:23.051594Z system:serviceaccount:kube-system:endpoint-controller update
      2024-08-16T16:20:23.100761Z system:serviceaccount:kube-system:endpoint-controller update
      2024-08-16T16:20:23.938927Z system:serviceaccount:kube-system:endpoint-controller update
      2024-08-16T16:21:01.699722Z system:serviceaccount:kube-system:endpoint-controller update
      2024-08-16T16:39:00.328312Z system:serviceaccount:kube-system:endpoint-controller update ==> At around 16:39:XX the first Pod was rolled out
      2024-08-16T16:39:07.260823Z system:serviceaccount:kube-system:endpoint-controller update
      2024-08-16T16:39:41.124449Z system:serviceaccount:kube-system:endpoint-controller update
      2024-08-16T16:43:23.701015Z system:serviceaccount:kube-system:endpoint-controller update ==> At around 16:43:23, the new Pod that replaced the second one was created
      2024-08-16T16:43:28.639793Z system:serviceaccount:kube-system:endpoint-controller update
      2024-08-16T16:43:47.108903Z system:serviceaccount:kube-system:endpoint-controller update

      We can see that just before the new-connections checks succeeded again at around "2024-08-16T16:43:23.", an UPDATE was received/treated which may have helped the apiserver sync its endpoints cache or/and chose a healthy Pod

      Also, no update was triggered when the second Pod was deleted at "16:43:17" which may explain the stale 10.128.2.31 endpoints entry on apiserver side.

      To summarize, I can see two problems here (maybe one is the consequence of the other):

      A Pod was deleted and an Endpoint pointing to it wasn't updated. Apparently the Endpoints controller had/has some sync issues https://github.com/kubernetes/kubernetes/issues/125638
      The apiserver resolver had a endpoints cache with one stale and one fresh entry but it kept 4-5 times in a row trying to reach the stale entry OR
      The endpoints was updated "At around 16:39:XX the first Pod was rolled out, see above", but the apiserver resolver cache missed that and ended up with 2 stale entries in the cache, and had to wait until "At around 16:43:23, the new Pod that replaced the second one was created, see above" to sync and replace them with 2 fresh entries.

      Version-Release number of selected component (if applicable):
      {code:none}
          

      How reproducible:

          

      Steps to Reproduce:

          1. See "Description of problem"
          2.
          3.
          

      Actual results:

          

      Expected results:

      the kube-aggregator should detect stale Apiservice endpoints.
          

      Additional info:

      the kube-aggregator proxies requests to a stale Endpoints/Pod which makes Metrics API requests falsely fail.
          

            rh-ee-amrini Ayoub Mrini
            rh-ee-amrini Ayoub Mrini
            Junqi Zhao Junqi Zhao
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated: