Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 4.14.z, 4.15.z, 4.17.z, 4.16.z, 4.18.z
Component/s: Cloud Compute / Machine CSR Approver
Labels:
None

Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

When kubernetes.io/kube-apiserver-client-kubelet CSRs are not approved, they are recreated with exponential backoff, causing the old CSRs to be abandoned and left in a pending state. This usually doesn't cause an issue because CSR approvals are typically quick. However, during large machine scale-ups, adding IP addresses to the machine object (which is required for node-to-machine mapping for CSR validation) can be slow, leading to longer CSR approval times. Eventually, the number of pending CSRs may reach the maximum allowed, halting CSR approvals.

Machine approver leaves invalid csrs in pending state. This behaviour is there to not reject valid csrs that should be approved by another controller and to allow manual approval by system admin.

Version-Release number of selected component (if applicable):

    I expect this to be the issue across all versions

How reproducible:

    Likely when scaling over 500 machines.

Steps to Reproduce:

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/58057/rehearse-58057-pull-ci-openshift-ovn-kubernetes-release-4.16-ovncore-perfscale-aws-ovn-xlarge-cluster-density-v2/1865957686453997568

Actual results:

    Machine approver logs:
sr-6mxzf: failed to find machine for node ip-10-0-5-76.ec2.internal, cannot approve
after limit is reached:
csr-zt477: Pending CSRs: 1074; Max pending allowed: 856. Difference between pending CSRs and machines > 100. Ignoring all CSRs as too many recent pending CSR

Expected results:

    Cluster machine approver should handle scaling to 750 nodes at once.

Additional info:

Example machine timeline:
machine creationTimestamp: "2024-12-09T04:57:29Z"
last attempt to reconcile one of its csrs before CSR limit is reached: 2024-12-09T05:21:39.211911225Z
Adresses added to machine: "2024-12-09T05:29:44Z"

for file in *.yaml; do ts=$(grep "creationTimestamp" "$file" | awk '{print $2}'); cn=$(grep "request:" "$file" | awk '{print $2}' | base64 -d | openssl req -noout -text -in - | grep CN= || echo "No CN found"); echo "$cn $ts"; done | sort

        Subject: O=system:nodes, CN=system:node:ip-10-0-99-70.ec2.internal "2024-12-09T05:23:48Z"
        Subject: O=system:nodes, CN=system:node:ip-10-0-99-70.ec2.internal "2024-12-09T05:23:58Z"
        Subject: O=system:nodes, CN=system:node:ip-10-0-99-70.ec2.internal "2024-12-09T05:39:01Z"
        Subject: O=system:nodes, CN=system:node:ip-10-0-99-70.ec2.internal "2024-12-09T05:54:05Z"
        Subject: O=system:nodes, CN=system:node:ip-10-0-99-70.ec2.internal "2024-12-09T06:09:13Z"
        Subject: O=system:nodes, CN=system:node:ip-10-0-99-70.ec2.internal "2024-12-09T06:24:30Z"
        Subject: O=system:nodes, CN=system:node:ip-10-0-99-70.ec2.internal "2024-12-09T06:40:02Z"
        Subject: O=system:nodes, CN=system:node:ip-10-0-99-70.ec2.internal "2024-12-09T06:56:02Z"
        Subject: O=system:nodes, CN=system:node:ip-10-0-99-70.ec2.internal "2024-12-09T07:11:30Z"
        Subject: O=system:nodes, CN=system:node:ip-10-0-99-70.ec2.internal "2024-12-09T07:26:58Z"

Assignee:: Radek Manak

Reporter:: Radek Manak

QA Contact:: Zhaohua Sun

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2024/12/20 10:41 PM

Updated:: 2024/12/20 11:16 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates