[OCPBUGS-31384] api-int Certificate Authority rotation during 4.14.17 to 4.15.3 update

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: 4.16.0
Affects Version/s: 4.15
Component/s: kube-apiserver
Labels:

Test Coverage:

+
Severity:
Critical
Regression:
Yes
Release Blocker:
Proposed
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
* Previously, clusters that were created before {product-title} 4.7 had several secrets of type `SecretTypeTLS`. Upon upgrading to {product-title} 4.16, these secrets are deleted and re-created with the type `kubernetes.io/tls`. This removal could cause a race condition and the contents of the secrets could be lost. With this release, the secret type change now happens automatically and clusters created before {product-title} 4.7 can upgrade to 4.16 without risking losing the contents of these secrets. (link:https://issues.redhat.com/browse/OCPBUGS-31384[*~~OCPBUGS-31384~~*])

Show
* Previously, clusters that were created before {product-title} 4.7 had several secrets of type `SecretTypeTLS`. Upon upgrading to {product-title} 4.16, these secrets are deleted and re-created with the type `kubernetes.io/tls`. This removal could cause a race condition and the contents of the secrets could be lost. With this release, the secret type change now happens automatically and clusters created before {product-title} 4.7 can upgrade to 4.16 without risking losing the contents of these secrets. (link: https://issues.redhat.com/browse/OCPBUGS-31384 [* OCPBUGS-31384 *])
Release Note Type:
Bug Fix
Release Note Status:
Done
Target Version:

4.16.0
Escape Reason:
Escape Impact:
Corrective Measures:
SDLC stage when should've been found:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

Description of problem:

In a cluster updating from 4.5.11 through many intermediate versions to 4.14.17 and on to 4.15.3 (initiated 2024-03-18T07:33:11Z), multus pods are sad about api-int X.509:

$ tar -xOz inspect.local.5020316083985214391/namespaces/openshift-kube-apiserver/core/events.yaml <hivei01ue1.inspect.local.5020316083985214391.gz | yaml2json | jq -r '[.items[] | select(.reason == "FailedCreatePodSandBox")][0].message'
(combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_installer-928-ip-10-164-221-242.ec2.internal_openshift-kube-apiserver_9e87f20b-471a-447e-9679-edce26b4ef78_0(8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c): error adding pod openshift-kube-apiserver_installer-928-ip-10-164-221-242.ec2.internal to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: '&{ContainerID:8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c Netns:/var/run/netns/6e2b0b10-5006-4bf9-bd74-17333e0cdceb IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-kube-apiserver;K8S_POD_NAME=installer-928-ip-10-164-221-242.ec2.internal;K8S_POD_INFRA_CONTAINER_ID=8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c;K8S_POD_UID=9e87f20b-471a-447e-9679-edce26b4ef78 Path: StdinData:[REDACTED]} ContainerID:"8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c" Netns:"/var/run/netns/6e2b0b10-5006-4bf9-bd74-17333e0cdceb" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-kube-apiserver;K8S_POD_NAME=installer-928-ip-10-164-221-242.ec2.internal;K8S_POD_INFRA_CONTAINER_ID=8322d383c477c29fe0221fdca5eaf5ca5b2f57f8a7077c7dd7d2861be0f5288c;K8S_POD_UID=9e87f20b-471a-447e-9679-edce26b4ef78" Path:"" ERRORED: error configuring pod [openshift-kube-apiserver/installer-928-ip-10-164-221-242.ec2.internal] networking: Multus: [openshift-kube-apiserver/installer-928-ip-10-164-221-242.ec2.internal/9e87f20b-471a-447e-9679-edce26b4ef78]: error waiting for pod: Get "https://api-int.REDACTED:6443/api/v1/namespaces/openshift-kube-apiserver/pods/installer-928-ip-10-164-221-242.ec2.internal?timeout=1m0s": tls: failed to verify certificate: x509: certificate signed by unknown authority

Version-Release number of selected component (if applicable)

4.15.3, so we have 4.15.2's ~~OCPBUGS-30304~~ but not 4.15.5's ~~OCPBUGS-30237~~.

How reproducible

Seen in two clusters after updating from 4.14 to 4.15.3.

Steps to Reproduce

Unclear.

Actual results

Sad multus pods.

Expected results

Happy cluster.

Additional info

$ openssl s_client -showcerts -connect api-int.REDACTED:6443 < /dev/null
...
Certificate chain
 0 s:CN = api-int.REDACTED
   i:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1710747228
   a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
   v:NotBefore: Mar 25 19:35:55 2024 GMT; NotAfter: Apr 24 19:35:56 2024 GMT
...
 1 s:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1710747228
   i:CN = openshift-kube-apiserver-operator_loadbalancer-serving-signer@1710747228
   a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
   v:NotBefore: Mar 18 07:33:47 2024 GMT; NotAfter: Mar 16 07:33:48 2034 GMT
...

So that's created seconds after the update was initiated. We have inspect logs for some namespaces, but they don't go back quite that far, because the machine-config roll at the end of the update into 4.15.3 rolled all the pods:

$ tar -xOz inspect.local.5020316083985214391/namespaces/openshift-kube-apiserver-operator/pods/kube-apiserver-operator-6cbfdd467c-4ctq7/kube-apiserver-operator/kube-apiserver-operator/logs/current.log <hivei01ue1.inspect.local.5020316083985214391.gz | head -n2
2024-03-18T08:22:05.058253904Z I0318 08:22:05.056255       1 cmd.go:241] Using service-serving-cert provided certificates
2024-03-18T08:22:05.058253904Z I0318 08:22:05.056351       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.

We were able to recover individual nodes via:

oc config new-kubelet-bootstrap-kubeconfig > bootstrap.kubeconfig from any machine with an admin kubeconfig
copy to all nodes as /etc/kubernetes/kubeconfig
on each node rm /var/lib/kubelet/kubeconfig
restart each node
approve each kubelet CSR
delete the node's multus-* pod.

blocks

OCPBUGS-31807 api-int Certificate Authority rotation during 4.14.17 to 4.15.3 update

Closed

is cloned by

OCPBUGS-31807 api-int Certificate Authority rotation during 4.14.17 to 4.15.3 update

Closed

is triggering

API-1893 Corrective Measure for OCPBUGS-31384: api-int Certificate Authority rotation during 4.14.17 to 4.15.3 update

Closed

relates to

OCPBUGS-30304 cert-syncer is forcibly changing secret type without retaining content

Closed

API-1687 Impact cert issues after 4.14 to 4.15 upgrade

Review

links to

openshift/cluster-etcd-operator#1234: OCPBUGS-31384: use RotatedSigningCASecret and RotatedSelfSignedCertKeySecret only in update mode

openshift/cluster-kube-apiserver-operator#1659: OCPBUGS-31384: use RotatedSigningCASecret controller in update only mode

openshift/cluster-kube-controller-manager-operator#800: "OCPBUGS-31384: use RotatedSigningCASecret and RotatedSelfSignedCertKeySecret only in update mode"

openshift/kubernetes#1924: OCPBUGS-31384: UPSTREAM: <carry>: allow type mutation for specific secrets

openshift/kubernetes#1929: OCPBUGS-31384: UPSTREAM: <carry>: allow type mutation for specific secrets

openshift/kubernetes#1932: OCPBUGS-31384: UPSTREAM: <carry>: allow type mutation for specific secrets

RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update

(7 links to)

Errata Tool added a comment - 2024/06/27 11:44 AM

Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

For information on the advisory (Critical: OpenShift Container Platform 4.16.0 bug fix and security update), and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2024:0041

Errata Tool added a comment - 2024/06/27 11:44 AM Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Critical: OpenShift Container Platform 4.16.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:0041

OpenShift Jira Bot added a comment - 2024/04/09 5:58 AM

Hi vrutkovs@redhat.com,

Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

OpenShift Jira Bot added a comment - 2024/04/09 5:58 AM Hi vrutkovs@redhat.com , Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

Ke Wang added a comment - 2024/04/09 5:57 AM - edited

I repeated this another 3 times with 4.16, got the expected results, so moving it verified.

Ke Wang added a comment - 2024/04/09 5:57 AM - edited I repeated this another 3 times with 4.16, got the expected results, so moving it verified.

Vadim Rutkovsky added a comment - 2024/04/08 5:29 PM

Verification looks good to me, thanks!

Vadim Rutkovsky added a comment - 2024/04/08 5:29 PM Verification looks good to me, thanks!

Assignee:: Vadim Rutkovsky

Reporter:: W. Trevor King

QA Contact:: Ke Wang

Votes:: 1 Vote for this issue

Watchers:: 20 Start watching this issue

Created:: 2024/03/25 8:09 PM

Updated:: 2025/03/25 8:23 PM

Resolved:: 2024/06/27 11:44 AM

Details

Description

Description of problem:

Version-Release number of selected component (if applicable)

How reproducible

Steps to Reproduce

Actual results

Expected results

Additional info

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Errata Tool added a comment - 2024/06/27 11:44 AM

Expand comment: Errata Tool added a comment - 2024/06/27 11:44 AM

Collapse comment: OpenShift Jira Bot added a comment - 2024/04/09 5:58 AM

Expand comment: OpenShift Jira Bot added a comment - 2024/04/09 5:58 AM

Collapse comment: Ke Wang added a comment - 2024/04/09 5:57 AM, Edited by Ke Wang - 2024/04/09 5:58 AM

Expand comment: Ke Wang added a comment - 2024/04/09 5:57 AM, Edited by Ke Wang - 2024/04/09 5:58 AM

Collapse comment: Vadim Rutkovsky added a comment - 2024/04/08 5:29 PM

Expand comment: Vadim Rutkovsky added a comment - 2024/04/08 5:29 PM

People

Dates