Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-48750

console operator degraded following service CA rotation by deleting the signing-key

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • Yes
    • None
    • None
    • None
    • In Progress
    • Bug Fix
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      following signing-key deletion, there is a service CA rotation process which might temporary disrupt cluster operators, but eventually all should regenerate. in recent 4.14 nighties however this is not the case anymore. following a deletion of the signing-key using
      oc delete secret/signing-key -n openshift-service-ca
      operators will progress for a while, but eventually console as well as monitoring will end up in available=false and degraded=true, which is only recoverable by manually deleting all the pods in the cluster.
      console                                    4.14.0-0.nightly-2023-06-30-131338   False       False         True       159m    RouteHealthAvailable: route not yet available, https://console-openshift-console.apps.evakhoni-0412.qe.gcp.devcluster.openshift.com returns '503 Service Unavailable' 
      monitoring                                 4.14.0-0.nightly-2023-06-30-131338   False       True          True       161m    reconciling Console Plugin failed: retrieving ConsolePlugin object failed: conversion webhook for console.openshift.io/v1alpha1, Kind=ConsolePlugin failed: Post "https://webhook.openshift-console-operator.svc:9443/crdconvert?timeout=30s": tls: failed to verify certificate: x509: certificate signed by unknown authority
      same deletion in the previous versions of 4.14-ec.2 or earlier doesn't have this issue, and able to recover eventually without any manual pod deletion. I believe this to be regression bug.

      Version-Release number of selected component (if applicable):

      4.14.0-0.nightly-2023-06-30-131338 and other recent 4.14 nightlies

      How reproducible:

      100%

      Steps to Reproduce:

      1.oc delete secret/signing-key -n openshift-service-ca
      2. wait at least 30+ minutes
      3. observe oc get co
      

      Actual results:

      console and monitoring degraded and not recovering

      Expected results:

      able to recover eventually as in previous versions

      Additional info:

      using manual deletion of all pods it is possible to recover the cluster from this state as follows:
      for I in $(oc get ns -o jsonpath='{range .items[*]} {.metadata.name}{"\n"} {end}'); \
            do oc delete pods --all -n $I; \
            sleep 1; \
            done

       

      must-gather:
      https://drive.google.com/file/d/1Y3RrYZlz0EncG-Iqt8USFPsTd-br36Zt/view?usp=sharing

       

              rh-ee-jonjacks Jon Jackson
              evakhoni@redhat.com Evgeni Vakhonin
              None
              None
              Yanping Zhang Yanping Zhang
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: