Details
-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
4.12.0
-
None
-
No
-
Proposed
-
False
-
Description
Clusterversion: 4.12.0
After running a longevity test on a ZTP SNO cluster under moderate load for 30-days, we experienced a random restart of many containers running on the cluster as well as a temporarily unresponsive kubeapi.
See more details in https://issues.redhat.com/browse/OCPBUGS-10510
At some point the node was restarted in attempt to fix everything.
After the reboot the console CO doesn't come up.
Running describe on console co was showing:
Name: console Namespace: Labels: <none> Annotations: capability.openshift.io/name: Console include.release.openshift.io/ibm-cloud-managed: true include.release.openshift.io/self-managed-high-availability: true include.release.openshift.io/single-node-developer: true API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2023-02-14T23:44:31Z Generation: 1 Managed Fields: API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:capability.openshift.io/name: f:include.release.openshift.io/ibm-cloud-managed: f:include.release.openshift.io/self-managed-high-availability: f:include.release.openshift.io/single-node-developer: f:ownerReferences: .: k:{"uid":"0297348c-5756-4997-bfa9-ea68024b6351"}: f:spec: Manager: cluster-version-operator Operation: Update Time: 2023-02-14T23:44:31Z API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: .: f:extension: Manager: cluster-version-operator Operation: Update Subresource: status Time: 2023-02-14T23:44:31Z API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: f:conditions: f:relatedObjects: f:versions: Manager: console Operation: Update Subresource: status Time: 2023-03-17T21:51:26Z Owner References: API Version: config.openshift.io/v1 Kind: ClusterVersion Name: version UID: 0297348c-5756-4997-bfa9-ea68024b6351 Resource Version: 12940156 UID: 91f86953-433d-4d98-a0c5-17fc7fe40522 Spec: Status: Conditions: Last Transition Time: 2023-03-17T20:33:40Z Message: ConsoleNotificationSyncDegraded: Delete "https://172.30.0.1:443/apis/console.openshift.io/v1/consolenotifications/cluster-upgrade": net/http: TLS handshake timeout RouteHealthDegraded: console route is not admitted Reason: ConsoleNotificationSync_FailedDelete::RouteHealth_RouteNotAdmitted Status: True Type: Degraded Last Transition Time: 2023-03-17T21:26:05Z Message: All is well Reason: AsExpected Status: False Type: Progressing Last Transition Time: 2023-03-17T21:51:26Z Message: RouteHealthAvailable: console route is not admitted Reason: RouteHealth_RouteNotAdmitted Status: False Type: Available Last Transition Time: 2023-02-15T00:15:58Z Message: All is well Reason: AsExpected Status: True Type: Upgradeable Extension: <nil> Related Objects: Group: operator.openshift.io Name: cluster Resource: consoles Group: config.openshift.io Name: cluster Resource: consoles Group: config.openshift.io Name: cluster Resource: infrastructures Group: config.openshift.io Name: cluster Resource: proxies Group: config.openshift.io Name: cluster Resource: oauths Group: oauth.openshift.io Name: console Resource: oauthclients Group: Name: openshift-console-operator Resource: namespaces Group: Name: openshift-console Resource: namespaces Group: Name: console-public Namespace: openshift-config-managed Resource: configmaps Versions: Name: operator Version: 4.12.0 Events: <none>
The following was showing in the console logs:
oc logs -n openshift-console console-67f8b7674f-hxh8r W0317 21:41:21.763514 1 main.go:227] Flag inactivity-timeout is set to less then 300 seconds and will be ignored! I0317 21:41:21.763558 1 main.go:346] cookies are secure! E0317 21:41:22.317033 1 auth.go:232] error contacting auth provider (retrying in 10s): Get "https://kubernetes.default.svc/.well-known/oauth-authorization-server": dial tcp: lookup kubernetes.default.svc on 172.30.0.10:53: read udp 10.128.0.193:59214->172.30.0.10:53: read: connection refused E0317 21:41:37.319330 1 auth.go:232] error contacting auth provider (retrying in 10s): Get "https://kubernetes.default.svc/.well-known/oauth-authorization-server": context deadline exceeded (Client.Timeout exceeded while awaiting headers) E0317 21:41:48.013176 1 auth.go:232] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.qe2.kni.lab.eng.bos.redhat.com/oauth/token failed: Head "https://oauth-openshift.apps.qe2.kni.lab.eng.bos.redhat.com": dial tcp 10.19.134.5:443: connect: connection refused E0317 21:41:58.158333 1 auth.go:232] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.qe2.kni.lab.eng.bos.redhat.com/oauth/token failed: Head "https://oauth-openshift.apps.qe2.kni.lab.eng.bos.redhat.com": dial tcp 10.19.134.5:443: connect: connection refused E0317 21:42:10.079348 1 auth.go:232] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.qe2.kni.lab.eng.bos.redhat.com/oauth/token failed: Head "https://oauth-openshift.apps.qe2.kni.lab.eng.bos.redhat.com": dial tcp 10.19.134.5:443: connect: connection refused E0317 21:42:21.232435 1 auth.go:232] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.qe2.kni.lab.eng.bos.redhat.com/oauth/token failed: Head "https://oauth-openshift.apps.qe2.kni.lab.eng.bos.redhat.com": dial tcp 10.19.134.5:443: connect: connection refused E0317 21:42:32.731721 1 auth.go:232] error contacting auth provider (retrying in 10s): request to OAuth issuer endpoint https://oauth-openshift.apps.qe2.kni.lab.eng.bos.redhat.com/oauth/token failed: Head "https://oauth-openshift.apps.qe2.kni.lab.eng.bos.redhat.com": dial tcp 10.19.134.5:443: connect: connection refused I0317 21:42:43.018052 1 main.go:796] Binding to [::]:8443... I0317 21:42:43.018152 1 main.go:798] using TLS 2023/03/17 21:49:28 Failed to dial backend: 'dial tcp 172.30.0.1:443: connect: connection refused' 2023/03/17 21:49:28 http: proxy error: dial tcp 172.30.0.1:443: connect: connection refused 2023/03/17 21:49:28 Failed to dial backend: 'dial tcp 172.30.0.1:443: connect: connection refused' 2023/03/17 21:49:31 Failed to dial backend: 'dial tcp 172.30.0.1:443: connect: connection refused' 2023/03/17 21:49:31 Failed to dial backend: 'dial tcp 172.30.0.1:443: connect: connection refused' 2023/03/17 21:49:33 Failed to dial backend: 'dial tcp 172.30.0.1:443: connect: connection refused' 2023/03/17 21:49:33 Failed to dial backend: 'dial tcp 172.30.0.1:443: connect: connection refused' 2023/03/17 21:49:37 Failed to dial backend: 'dial tcp 172.30.0.1:443: connect: connection refused' 2023/03/17 21:49:38 Failed to dial backend: 'dial tcp 172.30.0.1:443: connect: connection refused' 2023/03/17 21:49:49 http: TLS handshake error from 10.128.0.2:36436: EOF 2023/03/17 21:49:49 http: TLS handshake error from 10.128.0.2:36442: read tcp 10.128.0.193:8443->10.128.0.2:36442: read: connection reset by peer 2023/03/17 21:50:11 http: proxy error: context canceled 2023/03/17 21:50:11 http: proxy error: context canceled 2023/03/17 21:50:11 http: proxy error: context canceled 2023/03/17 21:50:11 http: proxy error: context canceled 2023/03/17 21:50:32 http: proxy error: context canceled 2023/03/17 21:50:42 http: proxy error: context canceled 2023/03/17 21:50:42 http: proxy error: context canceled 2023/03/17 21:50:42 http: proxy error: context canceled 2023/03/17 21:51:19 http: TLS handshake error from 10.128.0.2:46060: EOF 2023/03/17 21:51:19 http: TLS handshake error from 10.128.0.2:46074: EOF 2023/03/17 21:51:59 http: TLS handshake error from 10.128.0.2:35182: EOF 2023/03/17 21:52:00 http: TLS handshake error from 10.128.0.2:35194: EOF