Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.15.z
Component/s: kube-apiserver
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

The affected customer has AWS KMS configured. The "aws-kms-active" container (within the kube-apiserver pods) shows the following message.

{ "level": "error", "timestamp": "2024-06-21T13:24:57.037Z", "caller": "healthz/healthz.go:26", "message": "health check failed", "error": "failed to encrypt WebIdentityErr: failed to retrieve credentials\ncaused by: ExpiredTokenException: Token expired: current date/time 1718975997 must be before the expiration date/time 1718975896\n\tstatus code: 400, request id: 74359045-8196-487a-b596-dad926b1054e", "stacktrace": "sigs.k8s.io/aws-encryption-provider/pkg/healthz.(*handler).ServeHTTP\n\t/go/src/sigs.k8s.io/aws-encryption-provider/pkg/healthz/healthz.go:26\nnet/http.(*ServeMux).ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2487\nnet/http.serverHandler.ServeHTTP\n\t/usr/lib/golang/src/net/http/server.go:2947\nnet/http.(*conn).serve\n\t/usr/lib/golang/src/net/http/server.go:1991" }

After that the kube-apiserver container logs an error when trying to connect to the AWS socket:

E0621 14:20:33.525842       1 transformer.go:163] "failed to decrypt data" err="failed get version from remote KMS provider: rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing: dial unix /var/run/awskmsactive.sock: connect: connection refused\""

And the "aws-kms-active" also logs "address already in use" when trying to bind to the unix socket

{
  "level": "fatal",
  "timestamp": "2024-06-21T13:59:37.801Z",
  "caller": "server/main.go:98",
  "message": "Failed to start server",
  "error": "failed to create listener: listen unix /var/run/awskmsactive.sock: bind: address already in use",
  "stacktrace": "main.main.func2\n\t/go/src/sigs.k8s.io/aws-encryption-provider/cmd/server/main.go:98"
}

Deleting the kube-apiserver pods helped. Probably because the /var/run/awskmsactive.sock resides on an emptyDir volume. Which doesn't get cleaned up when the pod restarts due to an Error or CrashLoopBackoff state, but is cleaned up when the pod is deleted and replaced by a ReplicaSet/Deployment.

Version-Release number of selected component (if applicable):

4.15.15

How reproducible:

Probably reproducible by setting up an AWS KMS token to expire.

Steps to Reproduce:

1. Configure AWS KMS in kube-apiserver
2. Let Web Identity token expire
3. Observe kube-apiserver

Actual results:

aws-kms-active tries to bind to the unix socket, but fails due to another process being bound

Expected results:

aws-kms-active binds to the unix socket successfully or the socket is replaced with a fresh one

Additional info:
https://issues.redhat.com/browse/OHSS-35648
must-gather from the affected cluster can't be provided
I'll upload the kube-apiserver and aws-kms-active logs soon.

Assignee:: Unassigned

Reporter:: Leszek Jakubowski

Need Info From:: None

Contributors:: None

QA Contact:: Ke Wang

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024/06/21 3:36 PM

Updated:: 2025/07/22 11:37 AM

Resolved:: 2025/05/30 8:00 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide