Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.14
Component/s: apiserver-auth
Labels:
- service-delivery-impact
- service-delivery-prio-asks

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Among 3 master nodes, some pods when landing in the master-2 are consistently having issues to connect with the kubeAPI

Version-Release number of selected component (if applicable):

ARO 4.14.43

Troubleshooting steps:

Control Plane were rebooted.

No zombie processes were found

No openshift certificates are expired

Actual results:

Was observed the certain pods when landing on master-2 have issues, but at the moment that the pod goes to another master node, it works fine.

Example: In July 16th the Etcd Operator pod was having issues like: Error create client failure: failed to make etcd client for endpoints, causing the Cluster Operator to be flapping. We recreated this pod, which landed on another master, and since then it got stable.

The same happened to the Authentication Operator pod, that was landing on master-2, and failing with "WellKnownReadyController reconciliation failed: failed to GET kube-apiserver oauth endpoint https://10.211.11.7:6443/.well-known/oauth-authorization-server: dial tcp 10.211.11.7:6443: i/o timeout" and once it moved to another master, it got resolved.

So right now, there is no flappy CO, but there are clearly several issues on some pods that are landing on master-2.

The most clear ones are:

In oauth-apiserver

grpc: addrConn.createTransport failed to connect to
{Addr: "10.211.11.11:2379", ServerName: "10.211.11.11", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.211.11.11:2379: i/o timeout"

Presented currently on the kube-rbac-proxy for the pods : dns-default / network-metrics-daemon / multus-admission controller in master-2.

webhook.go:154 Failed to make webhook authenticator request: Post "https://172.28.128.1:443/apis/authentication.k8s.io/v1/tokenreviews": net/http: TLS handshake timeout
auth.go:47 Unable to authenticate the request due to an error: context canceled

AS-IS right now it does not seems safe to upgrade.

Assignee:: Unassigned

Reporter:: Hevellyn Gomes

QA Contact:: Xingxing Xia

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2025/07/29 11:25 AM

Updated:: 2025/11/19 12:51 PM

Resolved:: 2025/11/19 12:51 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates