Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-15827

console operator degraded following service CA rotation by deleting the signing-key

XMLWordPrintable

    • Yes
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, the console API conversion webhook server could not update serving certificates at runtime, and would fail if these certificates were updated by deleting the signing key. This would cause the console to not recover when CA certs were rotated. With this upste, console conversion webhook server was updated to detect CA certificate changes, and handle them at runtime. The server now remains available and the console recovers as expected after CA certificates are rotated. (link:https://issues.redhat.com/browse/OCPBUGS-15827[*OCPBUGS-15827*])
      Show
      * Previously, the console API conversion webhook server could not update serving certificates at runtime, and would fail if these certificates were updated by deleting the signing key. This would cause the console to not recover when CA certs were rotated. With this upste, console conversion webhook server was updated to detect CA certificate changes, and handle them at runtime. The server now remains available and the console recovers as expected after CA certificates are rotated. (link: https://issues.redhat.com/browse/OCPBUGS-15827 [* OCPBUGS-15827 *])
    • Bug Fix
    • Done

      Description of problem:

      following signing-key deletion, there is a service CA rotation process which might temporary disrupt cluster operators, but eventually all should regenerate. in recent 4.14 nighties however this is not the case anymore. following a deletion of the signing-key using
      oc delete secret/signing-key -n openshift-service-ca
      operators will progress for a while, but eventually console as well as monitoring will end up in available=false and degraded=true, which is only recoverable by manually deleting all the pods in the cluster.
      console                                    4.14.0-0.nightly-2023-06-30-131338   False       False         True       159m    RouteHealthAvailable: route not yet available, https://console-openshift-console.apps.evakhoni-0412.qe.gcp.devcluster.openshift.com returns '503 Service Unavailable' 
      monitoring                                 4.14.0-0.nightly-2023-06-30-131338   False       True          True       161m    reconciling Console Plugin failed: retrieving ConsolePlugin object failed: conversion webhook for console.openshift.io/v1alpha1, Kind=ConsolePlugin failed: Post "https://webhook.openshift-console-operator.svc:9443/crdconvert?timeout=30s": tls: failed to verify certificate: x509: certificate signed by unknown authority
      same deletion in the previous versions of 4.14-ec.2 or earlier doesn't have this issue, and able to recover eventually without any manual pod deletion. I believe this to be regression bug.

      Version-Release number of selected component (if applicable):

      4.14.0-0.nightly-2023-06-30-131338 and other recent 4.14 nightlies

      How reproducible:

      100%

      Steps to Reproduce:

      1.oc delete secret/signing-key -n openshift-service-ca
      2. wait at least 30+ minutes
      3. observe oc get co
      

      Actual results:

      console and monitoring degraded and not recovering

      Expected results:

      able to recover eventually as in previous versions

      Additional info:

      using manual deletion of all pods it is possible to recover the cluster from this state as follows:
      for I in $(oc get ns -o jsonpath='{range .items[*]} {.metadata.name}{"\n"} {end}'); \
            do oc delete pods --all -n $I; \
            sleep 1; \
            done

       

      must-gather:
      https://drive.google.com/file/d/1Y3RrYZlz0EncG-Iqt8USFPsTd-br36Zt/view?usp=sharing

       

            rh-ee-jonjacks Jon Jackson
            evakhoni@redhat.com Evgeni Vakhonin
            Yanping Zhang Yanping Zhang
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated:
              Resolved: