-
Bug
-
Resolution: Done
-
Major
-
4.12
-
Quality / Stability / Reliability
-
False
-
-
None
-
Moderate
-
None
-
None
-
Rejected
-
OTA 227
-
1
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
cluster-version-operator pod crashloop during the bootstrap process might be leading to a longer bootstrap process causing the installer to timeout and fail.
The cluster-version-operator pod is continuously restarting due to a go panic. The bootstrap process fails due to the timeout although it completes the process correctly after more time, once the cluster-version-operator pod runs correctly.
$ oc -n openshift-cluster-version logs -p cluster-version-operator-754498df8b-5gll8
I0919 10:25:05.790124 1 start.go:23] ClusterVersionOperator 4.12.0-202209161347.p0.gc4fd1f4.assembly.stream-c4fd1f4
F0919 10:25:05.791580 1 start.go:29] error: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0x1)
/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:860 +0x8a
k8s.io/klog/v2.(*loggingT).output(0x2bee180, 0x3, 0x0, 0xc00017d5e0, 0x1, {0x22e9abc?, 0x1?}, 0x2beed80?, 0x0)
/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:825 +0x686
k8s.io/klog/v2.(*loggingT).printfDepth(0x2bee180, 0x0?, 0x0, {0x0, 0x0}, 0x1?, {0x1b9cff0, 0x9}, {0xc000089140, 0x1, ...})
/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:630 +0x1f2
k8s.io/klog/v2.(*loggingT).printf(...)
/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:612
k8s.io/klog/v2.Fatalf(...)
/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:1516
main.init.3.func1(0xc00012ac80?, {0x1b96f60?, 0x6?, 0x6?})
/go/src/github.com/openshift/cluster-version-operator/cmd/start.go:29 +0x1e6
github.com/spf13/cobra.(*Command).execute(0xc00012ac80, {0xc0002fea20, 0x6, 0x6})
/go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:860 +0x663
github.com/spf13/cobra.(*Command).ExecuteC(0x2bd52a0)
/go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:974 +0x3b4
github.com/spf13/cobra.(*Command).Execute(...)
/go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:902
main.main()
/go/src/github.com/openshift/cluster-version-operator/cmd/main.go:29 +0x46
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-09-18-234318
How reproducible:
Most of the times, with any network type and installation type (IPI, UPI and proxy).
Steps to Reproduce:
1. Install OCP 4.12 IPI $ openshift-install create cluster 2. Wait until bootstrap is completed
Actual results:
[...] level=error msg="Bootstrap failed to complete: timed out waiting for the condition" level=error msg="Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane."
NAMESPACE NAME READY STATUS RESTARTS AGE openshift-cluster-version cluster-version-operator-754498df8b-5gll8 0/1 CrashLoopBackOff 7 (3m21s ago) 24m openshift-image-registry image-registry-94fd8b75c-djbxb 0/1 Pending 0 6m44s openshift-image-registry image-registry-94fd8b75c-ft66c 0/1 Pending 0 6m44s openshift-ingress router-default-64fbb749b4-cmqgw 0/1 Pending 0 13m openshift-ingress router-default-64fbb749b4-mhtqx 0/1 Pending 0 13m openshift-monitoring prometheus-operator-admission-webhook-6d8cb95cf7-6jn5q 0/1 Pending 0 14m openshift-monitoring prometheus-operator-admission-webhook-6d8cb95cf7-r6nnk 0/1 Pending 0 14m openshift-network-diagnostics network-check-source-8758bd6fc-vzf5k 0/1 Pending 0 18m openshift-operator-lifecycle-manager collect-profiles-27726375-hlq89 0/1 Pending 0 21m
$ oc -n openshift-cluster-version describe pod cluster-version-operator-754498df8b-5gll8
Name: cluster-version-operator-754498df8b-5gll8
Namespace: openshift-cluster-version
Priority: 2000000000
Priority Class Name: system-cluster-critical
Node: ostest-4gtwr-master-1/10.196.0.68
Start Time: Mon, 19 Sep 2022 10:17:41 +0000
Labels: k8s-app=cluster-version-operator
pod-template-hash=754498df8b
Annotations: openshift.io/scc: hostaccess
Status: Running
IP: 10.196.0.68
IPs:
IP: 10.196.0.68
Controlled By: ReplicaSet/cluster-version-operator-754498df8b
Containers:
cluster-version-operator:
Container ID: cri-o://1e2879600c89baabaca68c1d4d0a563d4b664c507f0617988cbf9ea7437f0b27
Image: registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69
Image ID: registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69
Port: <none>
Host Port: <none>
Args:
start
--release-image=registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69
--enable-auto-update=false
--listen=0.0.0.0:9099
--serving-cert-file=/etc/tls/serving-cert/tls.crt
--serving-key-file=/etc/tls/serving-cert/tls.key
--v=2
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Message: I0919 10:33:07.798614 1 start.go:23] ClusterVersionOperator 4.12.0-202209161347.p0.gc4fd1f4.assembly.stream-c4fd1f4
F0919 10:33:07.800115 1 start.go:29] error: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused
goroutine 1 [running]: [43/497]
k8s.io/klog/v2.stacks(0x1)
/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:860 +0x8a
k8s.io/klog/v2.(*loggingT).output(0x2bee180, 0x3, 0x0, 0xc000433ea0, 0x1, {0x22e9abc?, 0x1?}, 0x2beed80?, 0x0)
/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:825 +0x686
k8s.io/klog/v2.(*loggingT).printfDepth(0x2bee180, 0x0?, 0x0, {0x0, 0x0}, 0x1?, {0x1b9cff0, 0x9}, {0xc0002d6630, 0x1, ...})
/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:630 +0x1f2
k8s.io/klog/v2.(*loggingT).printf(...)
/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:612
k8s.io/klog/v2.Fatalf(...)
/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:1516
main.init.3.func1(0xc0003b4f00?, {0x1b96f60?, 0x6?, 0x6?})
/go/src/github.com/openshift/cluster-version-operator/cmd/start.go:29 +0x1e6
github.com/spf13/cobra.(*Command).execute(0xc0003b4f00, {0xc000311980, 0x6, 0x6})
/go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:860 +0x663
github.com/spf13/cobra.(*Command).ExecuteC(0x2bd52a0)
/go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:974 +0x3b4
github.com/spf13/cobra.(*Command).Execute(...)
/go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:902
main.main()
/go/src/github.com/openshift/cluster-version-operator/cmd/main.go:29 +0x46 Exit Code: 255
Started: Mon, 19 Sep 2022 10:33:07 +0000
Finished: Mon, 19 Sep 2022 10:33:07 +0000
Ready: False
Restart Count: 7
Requests:
cpu: 20m
memory: 50Mi
Environment:
KUBERNETES_SERVICE_PORT: 6443
KUBERNETES_SERVICE_HOST: 127.0.0.1
NODE_NAME: (v1:spec.nodeName)
CLUSTER_PROFILE: self-managed-high-availability
Mounts:
/etc/cvo/updatepayloads from etc-cvo-updatepayloads (ro)
/etc/ssl/certs from etc-ssl-certs (ro)
/etc/tls/service-ca from service-ca (ro)
/etc/tls/serving-cert from serving-cert (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access (ro)
onditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
etc-ssl-certs:
Type: HostPath (bare host directory volume)
Path: /etc/ssl/certs
HostPathType:
etc-cvo-updatepayloads:
Type: HostPath (bare host directory volume)
Path: /etc/cvo/updatepayloads
HostPathType:
serving-cert:
Type: Secret (a volume populated by a Secret)
SecretName: cluster-version-operator-serving-cert
Optional: false
service-ca:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: openshift-service-ca.crt
Optional: false
kube-api-access:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3600
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: node-role.kubernetes.io/master=
Tolerations: node-role.kubernetes.io/master:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/network-unavailable:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 120s
node.kubernetes.io/unreachable:NoExecute op=Exists for 120s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 25m default-scheduler no nodes available to schedule pods
Warning FailedScheduling 21m default-scheduler 0/2 nodes are available: 2 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true}. preemption: 0/2 nodes are available: 2 Preemption is no
t helpful for scheduling.
Normal Scheduled 19m default-scheduler Successfully assigned openshift-cluster-version/cluster-version-operator-754498df8b-5gll8 to ostest-4gtwr-master-1 by ostest-4gtwr-bootstrap
Warning FailedMount 17m kubelet Unable to attach or mount volumes: unmounted volumes=[serving-cert], unattached volumes=[service-ca kube-api-access etc-ssl-certs etc-cvo-updatepayloads serving-cert]:
timed out waiting for the condition
Warning FailedMount 17m (x9 over 19m) kubelet MountVolume.SetUp failed for volume "serving-cert" : secret "cluster-version-operator-serving-cert" not found
Normal Pulling 15m kubelet Pulling image "registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69"
Normal Pulled 15m kubelet Successfully pulled image "registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69" in 7.481824271s
Normal Started 14m (x3 over 15m) kubelet Started container cluster-version-operator
Normal Created 14m (x4 over 15m) kubelet Created container cluster-version-operator
Normal Pulled 14m (x3 over 15m) kubelet Container image "registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69" already present on machine
Warning BackOff 4m22s (x52 over 15m) kubelet Back-off restarting failed container
Expected results:
No panic?
Additional info:
Seen in most of OCP on OSP QE CI jobs.
Attached [^must-gather-install.tar.gz]
- clones
-
OCPBUGS-1458 cvo pod crashloop during bootstrap: featuregates: connection refused
-
- Closed
-
- depends on
-
OCPBUGS-1458 cvo pod crashloop during bootstrap: featuregates: connection refused
-
- Closed
-
- links to