Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-15827

console operator degraded following service CA rotation by deleting the signing-key

XMLWordPrintable

    • Yes
    • False
    • Hide

      None

      Show
      None
    • Hide
      The console API conversion webhook server could not update serving certs at runtime, and would therefore fail if these certs were updated by deleting the signing key. This would cause the console never to recover when CA certs were rotated. We updated the console conversion webhook server to detect CA cert changes, and handle them at runtime. The server now remains available and the console recovers as expected after CA certs are rotated.
      Show
      The console API conversion webhook server could not update serving certs at runtime, and would therefore fail if these certs were updated by deleting the signing key. This would cause the console never to recover when CA certs were rotated. We updated the console conversion webhook server to detect CA cert changes, and handle them at runtime. The server now remains available and the console recovers as expected after CA certs are rotated.
    • Bug Fix
    • In Progress

      Description of problem:

      following signing-key deletion, there is a service CA rotation process which might temporary disrupt cluster operators, but eventually all should regenerate. in recent 4.14 nighties however this is not the case anymore. following a deletion of the signing-key using
      oc delete secret/signing-key -n openshift-service-ca
      operators will progress for a while, but eventually console as well as monitoring will end up in available=false and degraded=true, which is only recoverable by manually deleting all the pods in the cluster.
      console                                    4.14.0-0.nightly-2023-06-30-131338   False       False         True       159m    RouteHealthAvailable: route not yet available, https://console-openshift-console.apps.evakhoni-0412.qe.gcp.devcluster.openshift.com returns '503 Service Unavailable' 
      monitoring                                 4.14.0-0.nightly-2023-06-30-131338   False       True          True       161m    reconciling Console Plugin failed: retrieving ConsolePlugin object failed: conversion webhook for console.openshift.io/v1alpha1, Kind=ConsolePlugin failed: Post "https://webhook.openshift-console-operator.svc:9443/crdconvert?timeout=30s": tls: failed to verify certificate: x509: certificate signed by unknown authority
      same deletion in the previous versions of 4.14-ec.2 or earlier doesn't have this issue, and able to recover eventually without any manual pod deletion. I believe this to be regression bug.

      Version-Release number of selected component (if applicable):

      4.14.0-0.nightly-2023-06-30-131338 and other recent 4.14 nightlies

      How reproducible:

      100%

      Steps to Reproduce:

      1.oc delete secret/signing-key -n openshift-service-ca
      2. wait at least 30+ minutes
      3. observe oc get co
      

      Actual results:

      console and monitoring degraded and not recovering

      Expected results:

      able to recover eventually as in previous versions

      Additional info:

      using manual deletion of all pods it is possible to recover the cluster from this state as follows:
      for I in $(oc get ns -o jsonpath='{range .items[*]} {.metadata.name}{"\n"} {end}'); \
            do oc delete pods --all -n $I; \
            sleep 1; \
            done

       

      must-gather:
      https://drive.google.com/file/d/1Y3RrYZlz0EncG-Iqt8USFPsTd-br36Zt/view?usp=sharing

       

            rh-ee-jonjacks Jon Jackson
            evakhoni@redhat.com Evgeni Vakhonin
            Yanping Zhang Yanping Zhang
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated: