Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-35925

Kube-apiserver pods with AWS KMS configuration get stuck after an expired token

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Critical Critical
    • None
    • 4.15.z
    • kube-apiserver
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • No
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      The affected customer has AWS KMS configured. The "aws-kms-active" container (within the kube-apiserver pods) shows the following message.

      { "level": "error", "timestamp": "2024-06-21T13:24:57.037Z", "caller": "healthz/healthz.go:26", "message": "health check failed", "error": "failed to encrypt WebIdentityErr: failed to retrieve credentials\ncaused by: ExpiredTokenException: Token expired: current date/time 1718975997 must be before the expiration date/time 1718975896\n\tstatus code: 400, request id: 74359045-8196-487a-b596-dad926b1054e", "stacktrace": "sigs.k8s.io/aws-encryption-provider/pkg/healthz.(*handler).ServeHTTP\n\t/go/src/sigs.k8s.io/aws-encryption-provider/pkg/healthz/healthz.go:26\nnet/http.(*ServeMux).ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2487\nnet/http.serverHandler.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2947\nnet/http.(*conn).serve\n\t/usr/lib/golang/src/net/http/server.go:1991" } 

      After that the kube-apiserver container logs an error when trying to connect to the AWS socket:

      E0621 14:20:33.525842       1 transformer.go:163] "failed to decrypt data" err="failed get version from remote KMS provider: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing: dial unix /var/run/awskmsactive.sock: connect: connection refused\"" 

      And the "aws-kms-active" also logs "address already in use" when trying to bind to the unix socket

      {
        "level": "fatal",
        "timestamp": "2024-06-21T13:59:37.801Z",
        "caller": "server/main.go:98",
        "message": "Failed to start server",
        "error": "failed to create listener: listen unix /var/run/awskmsactive.sock: bind: address already in use",
        "stacktrace": "main.main.func2\n\t/go/src/sigs.k8s.io/aws-encryption-provider/cmd/server/main.go:98"
      }

      Deleting the kube-apiserver pods helped. Probably because the /var/run/awskmsactive.sock resides on an emptyDir volume. Which doesn't get cleaned up when the pod restarts due to an Error or CrashLoopBackoff state, but is cleaned up when the pod is deleted and replaced by a ReplicaSet/Deployment.

      Version-Release number of selected component (if applicable):

      4.15.15

      How reproducible:

      Probably reproducible by setting up an AWS KMS token to expire.

      Steps to Reproduce:

      1. Configure AWS KMS in kube-apiserver
      2. Let Web Identity token expire
      3. Observe kube-apiserver
      

      Actual results:

      aws-kms-active tries to bind to the unix socket, but fails due to another process being bound

      Expected results:

      aws-kms-active binds to the unix socket successfully or the socket is replaced with a fresh one

      Additional info:
      https://issues.redhat.com/browse/OHSS-35648
      must-gather from the affected cluster can't be provided
      I'll upload the kube-apiserver and aws-kms-active logs soon.

              Unassigned Unassigned
              ljakubow2.openshift Leszek Jakubowski
              None
              None
              Ke Wang Ke Wang
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: