-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.16.z
-
Critical
-
None
-
Arbok OLM Sprint 261
-
1
-
Rejected
-
False
-
Description of problem:
Upon upgrade of 4.16.15, OLM is failing to upgrade operator cluster service versions due to a TLS validation error. From the OLM controller manager pod, logs show this: oc logs -n openshift-operator-lifecycle-manager olm-operator-7c9f76554-j22j5 | grep "tls" | head "tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "Red Hat, Inc.")" It's also observed in the api-server-operator logs that many webhooks are affected with the following errors: $ oc logs -n openshift-kube-apiserver-operator kube-apiserver-operator-8445495998-s6wgd | grep "failed to connect" | tail W1018 21:44:07.641047 1 degraded_webhook.go:147] failed to connect to webhook "machineautoscalers.autoscaling.openshift.io" via service "cluster-autoscaler-operator.openshift-machine-api.svc:443": tls: failed to verify certificate: x509: certificate signed by unknown authority W1018 21:44:08.647623 1 degraded_webhook.go:147] failed to connect to webhook "machineautoscalers.autoscaling.openshift.io" via service "cluster-autoscaler-operator.openshift-machine-api.svc:443": tls: failed to verify certificate: x509: certificate signed by unknown authority W1018 21:53:58.542660 1 degraded_webhook.go:147] failed to connect to webhook "clusterautoscalers.autoscaling.openshift.io" via service "cluster-autoscaler-operator.openshift-machine-api.svc:443": tls: failed to verify certificate: x509: certificate signed by unknown authority This is causing the OLM controller to hang and is failing to install/upgrade operators based on the OLM controller logs.
How reproducible:
Very reproducible upon upgrade from 4.16.14 to 4.16.15 on any Openshift Dedicated or ROSA Openshfit cluster.
Steps to Reproduce:
1. Install OSD or ROSA cluster at 4.16.14 or below 2. Upgrade to 4.16.15 3. Attempt to install or upgrade operator via new ClusterServiceVersion
Actual results:
# API SERVER OPERATOR $ oc logs -n openshift-kube-apiserver-operator kube-apiserver-operator-666b796d8b-lqp56 | grep "failed to connect" | tail W1013 20:59:49.131870 1 degraded_webhook.go:147] failed to connect to webhook "webhook.pipeline.tekton.dev" via service "tekton-pipelines-webhook.openshift-pipelines.svc:443": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "tekton-pipelines-webhook.openshift-pipelines.svc") W1013 20:59:50.147945 1 degraded_webhook.go:147] failed to connect to webhook "webhook.pipeline.tekton.dev" via service "tekton-pipelines-webhook.openshift-pipelines.svc:443": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "tekton-pipelines-webhook.openshift-pipelines.svc") #OLM $ oc logs -n openshift-operator-lifecycle-manager olm-operator-7c9f76554-j22j5 | grep "tls" | head 2024/10/13 12:00:08 http: TLS handshake error from 10.128.18.80:53006: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "Red Hat, Inc.") 2024/10/14 11:45:05 http: TLS handshake error from 10.130.19.10:36766: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "Red Hat, Inc.")
Expected results:
no tls validation errors upon upgrade or installation of operators via OLM
Additional info:
- is blocked by
-
OPRUN-3591 Impact statement request for OCPBUGS-43581 [4.16.15] TLS Validation Errors Upon Upgrade
- Closed
- links to
-
RHEA-2024:6122 OpenShift Container Platform 4.18.z bug fix update
- mentioned on