Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-41717

Kube-aggregator reaching stale apiservice endpoints

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Critical Critical
    • 4.16.z
    • 4.17
    • Monitoring
    • Yes
    • MON Sprint 259
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Release Note Not Required
    • In Progress

      This is a clone of issue OCPBUGS-41580. The following is the description of the original issue:

      This is a clone of issue OCPBUGS-39133. The following is the description of the original issue:

      Description of problem:

      Debugging https://issues.redhat.com/browse/OCPBUGS-36808 (the Metrics API failing some of the disruption checks) and taking https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-cluster-monitoring-operator-2439-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1824454734052855808 as a reproducer of the issue, I think the Kube-aggregator is behind the problem.
      
      According to the disruption checks which forward some relevant errors from the apiserver in the logs, looking at one of the new-connections check failures (from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-cluster-monitoring-operator-2439-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1824454734052855808/artifacts/e2e-aws-ovn-upgrade-2/openshift-e2e-test/artifacts/junit/backend-disruption_20240816-155051.json)
      
      > 	"Aug 16 *16:43:17.672* - 2s    E backend-disruption-name/metrics-api-new-connections connection/new disruption/openshift-tests reason/DisruptionBegan request-audit-id/c62b7d32-856f-49de-86f5-1daed55326b2 backend-disruption-name/metrics-api-new-connections connection/new disruption/openshift-tests stopped responding to GET requests over new connections: error running request: 503 Service Unavailable: error trying to reach service: dial tcp 10.128.2.31:10250: connect: connection refused"
      
      The "error trying to reach service" part comes from: https://github.com/kubernetes/kubernetes/blob/b3c725627b15bb69fca01b70848f3427aca4c3ef/staging/src/k8s.io/apimachinery/pkg/util/proxy/transport.go#L105, the apiserver failing to reach the metrics-server Pod, the problem is that the IP "10.128.2.31" corresponds to a Pod that was deleted some milliseconds before (as part of a node update/draining), as we can see in:
      
      > 2024-08-16T16:19:43.087Z|00195|binding|INFO|openshift-monitoring_metrics-server-7b9d8c5ddb-dtsmr: Claiming 0a:58:0a:80:02:1f 10.128.2.31
      ...
      I0816 *16:43:17.650083*    2240 kubelet.go:2453] "SyncLoop DELETE" source="api" pods=["openshift-monitoring/metrics-server-7b9d8c5ddb-dtsmr"]
      ...
      
      The apiserver was using a stale IP to reach a Pod that no longer exists, even though a new Pod that had already replaced the other Pod (Metrics API backend runs on 2 Pods), some minutes before, was available.
      According to OVN, a fresher IP 10.131.0.12 of that Pod was already in the endpoints at that time:
      
      > I0816 16:40:24.711048    4651 lb_config.go:1018] Cluster endpoints for openshift-monitoring/metrics-server are: map[TCP/https:{10250 [10.128.2.31 10.131.0.12] []}]
      
      *I think, when "10.128.2.31" failed, the apiserver should have fallen back to "10.131.0.12", maybe it waits for some time/retries before doing so, or maybe it wasn't even aware of "10.131.0.12"*
      
      AFAIU, we have "--enable-aggregator-routing" set by default https://github.com/openshift/cluster-kube-apiserver-operator/blob/37df1b1f80d3be6036b9e31975ac42fcb21b6447/bindata/assets/config/defaultconfig.yaml#L101-L103 on the apiservers, so instead of forwarding to the metrics-server's service, apiserver directly reaches the Pods.
      
      For that it keeps track of the relevant services and endpoints https://github.com/kubernetes/kubernetes/blob/ad8a5f5994c0949b5da4240006d938e533834987/staging/src/k8s.io/kube-aggregator/pkg/apiserver/resolvers.go#L40
      
      bad decisions may be made if the if the services and/or endpoints cache are stale.
      
      Looking at the metrics-server (the Metrics API backend) endpoints changes in the apiserver audit logs:
      
      > $ grep -hr Event . | grep "endpoints/metrics-server" | jq -c 'select( .verb | match("watch|update"))' | jq -r '[.requestReceivedTimestamp,.user.username,.verb] | @tsv' | sort
      2024-08-16T15:39:57.575468Z	system:serviceaccount:kube-system:endpoint-controller	update
      2024-08-16T15:40:02.005051Z	system:serviceaccount:kube-system:endpoint-controller	update
      2024-08-16T15:40:35.085330Z	system:serviceaccount:kube-system:endpoint-controller	update
      2024-08-16T15:40:35.128519Z	system:serviceaccount:kube-system:endpoint-controller	update
      2024-08-16T16:19:41.148148Z	system:serviceaccount:kube-system:endpoint-controller	update
      2024-08-16T16:19:47.797420Z	system:serviceaccount:kube-system:endpoint-controller	update
      2024-08-16T16:20:23.051594Z	system:serviceaccount:kube-system:endpoint-controller	update
      2024-08-16T16:20:23.100761Z	system:serviceaccount:kube-system:endpoint-controller	update
      2024-08-16T16:20:23.938927Z	system:serviceaccount:kube-system:endpoint-controller	update
      2024-08-16T16:21:01.699722Z	system:serviceaccount:kube-system:endpoint-controller	update
      2024-08-16T16:39:00.328312Z	system:serviceaccount:kube-system:endpoint-controller	update ==> At around 16:39:XX the first Pod was rolled out
      2024-08-16T16:39:07.260823Z	system:serviceaccount:kube-system:endpoint-controller	update
      2024-08-16T16:39:41.124449Z	system:serviceaccount:kube-system:endpoint-controller	update
      2024-08-16T16:43:23.701015Z	system:serviceaccount:kube-system:endpoint-controller	update ==> At around 16:43:23, the new Pod that replaced the second one was created
      2024-08-16T16:43:28.639793Z	system:serviceaccount:kube-system:endpoint-controller	update
      2024-08-16T16:43:47.108903Z	system:serviceaccount:kube-system:endpoint-controller	update
      
      We can see that just before the new-connections checks succeeded again at around "2024-08-16T16:43:23.", an UPDATE was received/treated which may have helped the apiserver sync its endpoints cache or/and chose a healthy Pod
      
      Also, no update was triggered when the second Pod was deleted at "16:43:17" which may explain the stale 10.128.2.31 endpoints entry on apiserver side.
      
      To summarize, I can see two problems here (maybe one is the consequence of the other):
      
          A Pod was deleted and an Endpoint pointing to it wasn't updated. Apparently the Endpoints controller had/has some sync issues https://github.com/kubernetes/kubernetes/issues/125638
          The apiserver resolver had a endpoints cache with one stale and one fresh entry but it kept 4-5 times in a row trying to reach the stale entry OR
          The endpoints was updated "At around 16:39:XX the first Pod was rolled out, see above", but the apiserver resolver cache missed that and ended up with 2 stale entries in the cache, and had to wait until "At around 16:43:23, the new Pod that replaced the second one was created, see above" to sync and replace them with 2 fresh entries.
      
      
          

      Version-Release number of selected component (if applicable):

          

      How reproducible:

          

      Steps to Reproduce:

          1. See "Description of problem"
          2.
          3.
          

      Actual results:

          

      Expected results:

      the kube-aggregator should detect stale Apiservice endpoints.
          

      Additional info:

      the kube-aggregator proxies requests to a stale Endpoints/Pod which makes Metrics API requests falsely fail.
          

            rh-ee-amrini Ayoub Mrini
            openshift-crt-jira-prow OpenShift Prow Bot
            Junqi Zhao Junqi Zhao
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: