Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-43581

[4.16.15] TLS Validation errors when collect-profiles access OLM metrics

XMLWordPrintable

    • Critical
    • None
    • Arbok OLM Sprint 261
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Upon upgrade of 4.16.15, OLM is failing to upgrade operator cluster service versions due to a TLS validation error. 
      
      From the OLM controller manager pod, logs show this: 
      oc logs -n openshift-operator-lifecycle-manager olm-operator-7c9f76554-j22j5 | grep "tls" | head
      "tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "Red Hat, Inc.")"
      
      It's also observed in the api-server-operator logs that many webhooks are affected with the following errors: 
      $ oc logs -n openshift-kube-apiserver-operator kube-apiserver-operator-8445495998-s6wgd | grep "failed to connect" | tail
      W1018 21:44:07.641047       1 degraded_webhook.go:147] failed to connect to webhook "machineautoscalers.autoscaling.openshift.io" via service "cluster-autoscaler-operator.openshift-machine-api.svc:443": tls: failed to verify certificate: x509: certificate signed by unknown authority
      W1018 21:44:08.647623       1 degraded_webhook.go:147] failed to connect to webhook "machineautoscalers.autoscaling.openshift.io" via service "cluster-autoscaler-operator.openshift-machine-api.svc:443": tls: failed to verify certificate: x509: certificate signed by unknown authority
      W1018 21:53:58.542660       1 degraded_webhook.go:147] failed to connect to webhook "clusterautoscalers.autoscaling.openshift.io" via service "cluster-autoscaler-operator.openshift-machine-api.svc:443": tls: failed to verify certificate: x509: certificate signed by unknown authority 
      
      This is causing the OLM controller to hang and is failing to install/upgrade operators based on the OLM controller logs. 

       

      How reproducible:

          Very reproducible upon upgrade from 4.16.14 to 4.16.15 on any Openshift Dedicated or ROSA Openshfit cluster.

      Steps to Reproduce:

          1. Install OSD or ROSA cluster at 4.16.14 or below
          2. Upgrade to 4.16.15
          3. Attempt to install or upgrade operator via new ClusterServiceVersion     

      Actual results:

      # API SERVER OPERATOR
          $ oc logs -n openshift-kube-apiserver-operator kube-apiserver-operator-666b796d8b-lqp56 | grep "failed to connect" | tail
      W1013 20:59:49.131870       1 degraded_webhook.go:147] failed to connect to webhook "webhook.pipeline.tekton.dev" via service "tekton-pipelines-webhook.openshift-pipelines.svc:443": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "tekton-pipelines-webhook.openshift-pipelines.svc")
      W1013 20:59:50.147945       1 degraded_webhook.go:147] failed to connect to webhook "webhook.pipeline.tekton.dev" via service "tekton-pipelines-webhook.openshift-pipelines.svc:443": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "tekton-pipelines-webhook.openshift-pipelines.svc")
      
      #OLM 
      $ oc logs -n openshift-operator-lifecycle-manager olm-operator-7c9f76554-j22j5 | grep "tls" | head
      2024/10/13 12:00:08 http: TLS handshake error from 10.128.18.80:53006: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "Red Hat, Inc.")
      2024/10/14 11:45:05 http: TLS handshake error from 10.130.19.10:36766: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "Red Hat, Inc.")

      Expected results:

          no tls validation errors upon upgrade or installation of operators via OLM

      Additional info:

          

            mradchuk@redhat.com Mikalai Radchuk
            drow.openshift.srep Dustin Row
            Jian Zhang Jian Zhang
            Votes:
            0 Vote for this issue
            Watchers:
            15 Start watching this issue

              Created:
              Updated: