-
Bug
-
Resolution: Done
-
Major
-
None
-
4.12
-
Moderate
-
None
-
5
-
OTA 226, OTA 227
-
2
-
Rejected
-
False
-
Description of problem:
cluster-version-operator pod crashloop during the bootstrap process might be leading to a longer bootstrap process causing the installer to timeout and fail. The cluster-version-operator pod is continuously restarting due to a go panic. The bootstrap process fails due to the timeout although it completes the process correctly after more time, once the cluster-version-operator pod runs correctly. $ oc -n openshift-cluster-version logs -p cluster-version-operator-754498df8b-5gll8 I0919 10:25:05.790124 1 start.go:23] ClusterVersionOperator 4.12.0-202209161347.p0.gc4fd1f4.assembly.stream-c4fd1f4 F0919 10:25:05.791580 1 start.go:29] error: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused goroutine 1 [running]: k8s.io/klog/v2.stacks(0x1) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:860 +0x8a k8s.io/klog/v2.(*loggingT).output(0x2bee180, 0x3, 0x0, 0xc00017d5e0, 0x1, {0x22e9abc?, 0x1?}, 0x2beed80?, 0x0) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:825 +0x686 k8s.io/klog/v2.(*loggingT).printfDepth(0x2bee180, 0x0?, 0x0, {0x0, 0x0}, 0x1?, {0x1b9cff0, 0x9}, {0xc000089140, 0x1, ...}) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:630 +0x1f2 k8s.io/klog/v2.(*loggingT).printf(...) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:612 k8s.io/klog/v2.Fatalf(...) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:1516 main.init.3.func1(0xc00012ac80?, {0x1b96f60?, 0x6?, 0x6?}) /go/src/github.com/openshift/cluster-version-operator/cmd/start.go:29 +0x1e6 github.com/spf13/cobra.(*Command).execute(0xc00012ac80, {0xc0002fea20, 0x6, 0x6}) /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:860 +0x663 github.com/spf13/cobra.(*Command).ExecuteC(0x2bd52a0) /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:974 +0x3b4 github.com/spf13/cobra.(*Command).Execute(...) /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:902 main.main() /go/src/github.com/openshift/cluster-version-operator/cmd/main.go:29 +0x46
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-09-18-234318
How reproducible:
Most of the times, with any network type and installation type (IPI, UPI and proxy).
Steps to Reproduce:
1. Install OCP 4.12 IPI $ openshift-install create cluster 2. Wait until bootstrap is completed
Actual results:
[...] level=error msg="Bootstrap failed to complete: timed out waiting for the condition" level=error msg="Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane."
NAMESPACE NAME READY STATUS RESTARTS AGE openshift-cluster-version cluster-version-operator-754498df8b-5gll8 0/1 CrashLoopBackOff 7 (3m21s ago) 24m openshift-image-registry image-registry-94fd8b75c-djbxb 0/1 Pending 0 6m44s openshift-image-registry image-registry-94fd8b75c-ft66c 0/1 Pending 0 6m44s openshift-ingress router-default-64fbb749b4-cmqgw 0/1 Pending 0 13m openshift-ingress router-default-64fbb749b4-mhtqx 0/1 Pending 0 13m openshift-monitoring prometheus-operator-admission-webhook-6d8cb95cf7-6jn5q 0/1 Pending 0 14m openshift-monitoring prometheus-operator-admission-webhook-6d8cb95cf7-r6nnk 0/1 Pending 0 14m openshift-network-diagnostics network-check-source-8758bd6fc-vzf5k 0/1 Pending 0 18m openshift-operator-lifecycle-manager collect-profiles-27726375-hlq89 0/1 Pending 0 21m
$ oc -n openshift-cluster-version describe pod cluster-version-operator-754498df8b-5gll8 Name: cluster-version-operator-754498df8b-5gll8 Namespace: openshift-cluster-version Priority: 2000000000 Priority Class Name: system-cluster-critical Node: ostest-4gtwr-master-1/10.196.0.68 Start Time: Mon, 19 Sep 2022 10:17:41 +0000 Labels: k8s-app=cluster-version-operator pod-template-hash=754498df8b Annotations: openshift.io/scc: hostaccess Status: Running IP: 10.196.0.68 IPs: IP: 10.196.0.68 Controlled By: ReplicaSet/cluster-version-operator-754498df8b Containers: cluster-version-operator: Container ID: cri-o://1e2879600c89baabaca68c1d4d0a563d4b664c507f0617988cbf9ea7437f0b27 Image: registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69 Image ID: registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69 Port: <none> Host Port: <none> Args: start --release-image=registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69 --enable-auto-update=false --listen=0.0.0.0:9099 --serving-cert-file=/etc/tls/serving-cert/tls.crt --serving-key-file=/etc/tls/serving-cert/tls.key --v=2 State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Message: I0919 10:33:07.798614 1 start.go:23] ClusterVersionOperator 4.12.0-202209161347.p0.gc4fd1f4.assembly.stream-c4fd1f4 F0919 10:33:07.800115 1 start.go:29] error: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused goroutine 1 [running]: [43/497] k8s.io/klog/v2.stacks(0x1) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:860 +0x8a k8s.io/klog/v2.(*loggingT).output(0x2bee180, 0x3, 0x0, 0xc000433ea0, 0x1, {0x22e9abc?, 0x1?}, 0x2beed80?, 0x0) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:825 +0x686 k8s.io/klog/v2.(*loggingT).printfDepth(0x2bee180, 0x0?, 0x0, {0x0, 0x0}, 0x1?, {0x1b9cff0, 0x9}, {0xc0002d6630, 0x1, ...}) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:630 +0x1f2 k8s.io/klog/v2.(*loggingT).printf(...) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:612 k8s.io/klog/v2.Fatalf(...) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:1516 main.init.3.func1(0xc0003b4f00?, {0x1b96f60?, 0x6?, 0x6?}) /go/src/github.com/openshift/cluster-version-operator/cmd/start.go:29 +0x1e6 github.com/spf13/cobra.(*Command).execute(0xc0003b4f00, {0xc000311980, 0x6, 0x6}) /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:860 +0x663 github.com/spf13/cobra.(*Command).ExecuteC(0x2bd52a0) /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:974 +0x3b4 github.com/spf13/cobra.(*Command).Execute(...) /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:902 main.main() /go/src/github.com/openshift/cluster-version-operator/cmd/main.go:29 +0x46 Exit Code: 255 Started: Mon, 19 Sep 2022 10:33:07 +0000 Finished: Mon, 19 Sep 2022 10:33:07 +0000 Ready: False Restart Count: 7 Requests: cpu: 20m memory: 50Mi Environment: KUBERNETES_SERVICE_PORT: 6443 KUBERNETES_SERVICE_HOST: 127.0.0.1 NODE_NAME: (v1:spec.nodeName) CLUSTER_PROFILE: self-managed-high-availability Mounts: /etc/cvo/updatepayloads from etc-cvo-updatepayloads (ro) /etc/ssl/certs from etc-ssl-certs (ro) /etc/tls/service-ca from service-ca (ro) /etc/tls/serving-cert from serving-cert (ro) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access (ro) onditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: etc-ssl-certs: Type: HostPath (bare host directory volume) Path: /etc/ssl/certs HostPathType: etc-cvo-updatepayloads: Type: HostPath (bare host directory volume) Path: /etc/cvo/updatepayloads HostPathType: serving-cert: Type: Secret (a volume populated by a Secret) SecretName: cluster-version-operator-serving-cert Optional: false service-ca: Type: ConfigMap (a volume populated by a ConfigMap) Name: openshift-service-ca.crt Optional: false kube-api-access: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3600 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true QoS Class: Burstable Node-Selectors: node-role.kubernetes.io/master= Tolerations: node-role.kubernetes.io/master:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/network-unavailable:NoSchedule op=Exists node.kubernetes.io/not-ready:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 120s node.kubernetes.io/unreachable:NoExecute op=Exists for 120s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 25m default-scheduler no nodes available to schedule pods Warning FailedScheduling 21m default-scheduler 0/2 nodes are available: 2 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true}. preemption: 0/2 nodes are available: 2 Preemption is no t helpful for scheduling. Normal Scheduled 19m default-scheduler Successfully assigned openshift-cluster-version/cluster-version-operator-754498df8b-5gll8 to ostest-4gtwr-master-1 by ostest-4gtwr-bootstrap Warning FailedMount 17m kubelet Unable to attach or mount volumes: unmounted volumes=[serving-cert], unattached volumes=[service-ca kube-api-access etc-ssl-certs etc-cvo-updatepayloads serving-cert]: timed out waiting for the condition Warning FailedMount 17m (x9 over 19m) kubelet MountVolume.SetUp failed for volume "serving-cert" : secret "cluster-version-operator-serving-cert" not found Normal Pulling 15m kubelet Pulling image "registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69" Normal Pulled 15m kubelet Successfully pulled image "registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69" in 7.481824271s Normal Started 14m (x3 over 15m) kubelet Started container cluster-version-operator Normal Created 14m (x4 over 15m) kubelet Created container cluster-version-operator Normal Pulled 14m (x3 over 15m) kubelet Container image "registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69" already present on machine Warning BackOff 4m22s (x52 over 15m) kubelet Back-off restarting failed container
Expected results:
No panic?
Additional info:
Seen in most of OCP on OSP QE CI jobs.
Attached must-gather-install.tar.gz
- is cloned by
-
OCPBUGS-3770 cvo pod crashloop during bootstrap: featuregates: connection refused
- Closed
- is depended on by
-
OCPBUGS-3770 cvo pod crashloop during bootstrap: featuregates: connection refused
- Closed
- links to