[OCPBUGS-15827] console operator degraded following service CA rotation by deleting the signing-key - Red Hat Issue Tracker

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.16.0
Affects Version/s: 4.14
Component/s: Management Console
Labels:
- regression

Regression:
Yes
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
* Previously, the console API conversion webhook server could not update serving certificates at runtime, and would fail if these certificates were updated by deleting the signing key. This would cause the console to not recover when CA certs were rotated. With this upste, console conversion webhook server was updated to detect CA certificate changes, and handle them at runtime. The server now remains available and the console recovers as expected after CA certificates are rotated. (link:https://issues.redhat.com/browse/OCPBUGS-15827[*~~OCPBUGS-15827~~*])

Show
* Previously, the console API conversion webhook server could not update serving certificates at runtime, and would fail if these certificates were updated by deleting the signing key. This would cause the console to not recover when CA certs were rotated. With this upste, console conversion webhook server was updated to detect CA certificate changes, and handle them at runtime. The server now remains available and the console recovers as expected after CA certificates are rotated. (link: https://issues.redhat.com/browse/OCPBUGS-15827 [* OCPBUGS-15827 *])
Release Note Type:
Bug Fix
Release Note Status:
Done
Target Version:

4.16.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

following signing-key deletion, there is a service CA rotation process which might temporary disrupt cluster operators, but eventually all should regenerate. in recent 4.14 nighties however this is not the case anymore. following a deletion of the signing-key using
oc delete secret/signing-key -n openshift-service-ca
operators will progress for a while, but eventually console as well as monitoring will end up in available=false and degraded=true, which is only recoverable by manually deleting all the pods in the cluster.

console                                    4.14.0-0.nightly-2023-06-30-131338   False       False         True       159m    RouteHealthAvailable: route not yet available, https://console-openshift-console.apps.evakhoni-0412.qe.gcp.devcluster.openshift.com returns '503 Service Unavailable'

monitoring                                 4.14.0-0.nightly-2023-06-30-131338   False       True          True       161m    reconciling Console Plugin failed: retrieving ConsolePlugin object failed: conversion webhook for console.openshift.io/v1alpha1, Kind=ConsolePlugin failed: Post "https://webhook.openshift-console-operator.svc:9443/crdconvert?timeout=30s": tls: failed to verify certificate: x509: certificate signed by unknown authority

same deletion in the previous versions of 4.14-ec.2 or earlier doesn't have this issue, and able to recover eventually without any manual pod deletion. I believe this to be regression bug.

Version-Release number of selected component (if applicable):

4.14.0-0.nightly-2023-06-30-131338 and other recent 4.14 nightlies

How reproducible:

100%

Steps to Reproduce:

1.oc delete secret/signing-key -n openshift-service-ca
2. wait at least 30+ minutes
3. observe oc get co

Actual results:

console and monitoring degraded and not recovering

Expected results:

able to recover eventually as in previous versions

Additional info:

using manual deletion of all pods it is possible to recover the cluster from this state as follows:
for I in $(oc get ns -o jsonpath='{range .items[*]} {.metadata.name}{"\n"} {end}'); \
      do oc delete pods --all -n $I; \
      sleep 1; \
      done

must-gather:
https://drive.google.com/file/d/1Y3RrYZlz0EncG-Iqt8USFPsTd-br36Zt/view?usp=sharing

is cloned by

OCPBUGS-48750 console operator degraded following service CA rotation by deleting the signing-key

ASSIGNED

OCPBUGS-26983 monitoring-plugin becomes unavailable after forcing the rotation of the service's certificate

Closed

is duplicated by

OCPBUGS-21818 After deleting and recreating default CA certificate - route not able due to bad certificate

Closed

links to

openshift/console-operator#822: OCPBUGS-15827: Update console conversion webhook server to use sig.k8s.io certwatcher

openshift/console-operator#831: Revert #822 "OCPBUGS-15827: Update console conversion webhook server to use sig.k8s.io certwatcher"

openshift/console-operator#833: OCPBUGS-15827: Revert #831 and fix cluster proxy annotation on console conversion webhook deployment

RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update

(2 links to)

Assignee:: Jon Jackson

Reporter:: Evgeni Vakhonin

QA Contact:: Yanping Zhang

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Created:: 2023/07/05 3:47 PM

Updated:: 2025/01/22 3:52 PM

Resolved:: 2024/06/27 11:31 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide