-
Bug
-
Resolution: Done
-
Critical
-
4.13.z
-
None
-
Critical
-
No
-
Proposed
-
False
-
Description of problem:
On 4.13.0-0.nightly-2023-07-26-101700 (which is the first 4.13.z nightly with FIPS wrapper enabled) a HyperShift managed cluster shows many pods CrashLooping with errors trying to reach the API server. NAMESPACE NAME READY STATUS RESTARTS AGE clusters-hypershift-ci-3983 capi-provider-548db9947d-m8cdh 1/2 CrashLoopBackOff 21 (4m36s ago) 74m clusters-hypershift-ci-3983 catalog-operator-64568cd87c-tsrdz 1/2 CrashLoopBackOff 23 (84s ago) 71m clusters-hypershift-ci-3983 cluster-node-tuning-operator-6f64d8b868-v9gsx 0/1 Error 10 (73s ago) 71m clusters-hypershift-ci-3983 control-plane-operator-dfbb565d4-zth66 1/2 CrashLoopBackOff 16 (23s ago) 74m clusters-hypershift-ci-3983 hosted-cluster-config-operator-645f485699-j9dlj 0/1 CrashLoopBackOff 13 (74s ago) 71m clusters-hypershift-ci-3983 kube-controller-manager-7bdb568cdf-5wzmm 1/2 Error 4 (55s ago) 16m clusters-hypershift-ci-3983 kube-controller-manager-7bdb568cdf-qbrcz 1/2 CrashLoopBackOff 7 (45s ago) 13m clusters-hypershift-ci-3983 oauth-openshift-679f564bcd-k6vvj 1/2 CrashLoopBackOff 8 (106s ago) 60m clusters-hypershift-ci-3983 olm-operator-868797cf6b-wgd6w 1/2 CrashLoopBackOff 20 (2m23s ago) 71m clusters-hypershift-ci-3983 openshift-apiserver-7f67974fb-c4s7s 2/3 CrashLoopBackOff 4 (17s ago) 16m clusters-hypershift-ci-3983 openshift-oauth-apiserver-76d796c4c6-bdrxq 1/2 CrashLoopBackOff 20 (2m19s ago) 71m clusters-hypershift-ci-3983 openshift-oauth-apiserver-76d796c4c6-nmfbd 1/2 CrashLoopBackOff 18 (3m46s ago) 71m clusters-hypershift-ci-3983 packageserver-6cbc459575-c8klz 1/2 CrashLoopBackOff 15 (73s ago) 71m hypershift operator-54f77b9766-dln4p 0/1 Error 9 (68s ago) 75m openshift-console downloads-75f7f4c67c-j9zvf 0/1 CrashLoopBackOff 23 (2m14s ago) 90m openshift-image-registry image-registry-58d8dc4f-bj4mg 0/1 CrashLoopBackOff 15 (85s ago) 89m openshift-ingress router-default-7999c48dd6-kvnr6 0/1 CrashLoopBackOff 18 (95s ago) 75m openshift-monitoring prometheus-adapter-5d77d8cfb5-52dpq 0/1 CrashLoopBackOff 20 (2m17s ago) 91m Examples: openshift-ovn-kubernetes/pods/ovnkube-node-ng767/kube-rbac-proxy-ovn-metrics/kube-rbac-proxy-ovn-metrics/logs/current.log:2023-07-26T19:54:50.636915874Z E0726 19:54:50.634277 3054 auth.go:47] Unable to authenticate the request due to an error: Post "https://172.30.0.1:443/apis/authentication.k8s.io/v1/tokenreviews": context deadline exceeded openshift-cloud-network-config-controller/pods/cloud-network-config-controller-6d9d577f59-98dxh/controller/controller/logs/current.log:2023-07-26T17:50:07.380904032Z E0726 17:50:07.380510 1 leaderelection.go:330] error retrieving resource lock openshift-cloud-network-config-controller/cloud-network-config-controller-lock: Get "https://api-int.zhozhanghyp4131b.qe.devcluster.openshift.com:6443/api/v1/namespaces/openshift-cloud-network-config-controller/configmaps/cloud-network-config-controller-lock": dial tcp: lookup api-int.zhozhanghyp4131b.qe.devcluster.openshift.com on 172.30.0.10:53: read udp 10.128.0.35:46569->172.30.0.10:53: read: connection refused openshift-ingress/pods/router-default-7999c48dd6-kvnr6/router/router/logs/current.log:2023-07-26T19:52:56.368183176Z E0726 19:52:56.368172 1 reflector.go:140] github.com/openshift/router/pkg/router/template/service_lookup.go:33: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://172.30.0.1:443/api/v1/services?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout kube-controller-manager log: W0726 22:07:54.887716 1 feature_gate.go:227] unrecognized feature gate: OpenShiftPodSecurityAdmission E0726 22:08:25.163085 1 run.go:74] "command failed" err="unable to load configmap based request-header-client-ca-file: Get \"https://kube-apiserver:6443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication\": dial tcp: lookup kube-apiserver: i/o timeout" oc adm inspect and oc adm must-gather location in the comments.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Install HyperShift managed cluster with 1 guest cluster (will put the QE Flexy install job in the comments) 2. Successful install completion 3.
Actual results
May pods crashlooping with messages about being unable to reach the API. Clusteroperators going into and out of error states. Commands failing with various errors trying to list resourcetypes.
Expected results:
Additional info:
- blocks
-
OCPBUGS-17120 [release-4.13] 4.13.z FIPS build - HyperShift - pods crashlooping with API connection failures
- Closed
- causes
-
SDN-4188 Examine hypershift network policies and potential scale problems
- Closed
- is cloned by
-
OCPBUGS-17120 [release-4.13] 4.13.z FIPS build - HyperShift - pods crashlooping with API connection failures
- Closed
- links to