Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.12.z
Component/s: Installer / Agent based installation
Labels:

Severity:
Important
Regression:
No
Sprint:
Sprint 244
sprint_count:
1
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

Description of problem:

While doing agent based install the cluster is taking too long (approx 20 hours) for installation. 

It is 4.12.23 ABI install.

Here are our observations :

a)	We could observe OS is installed on all the 3 servers but only the master-0 (rendezvous host)  node is accessible through ssh. The other nodes are pingable though.

b)	In the issue state we could see that APIVIP and Ingress VIP ip address were not configured to the main interface of master-0 (rendezvous host) node.

c)	We also observed in the boot up journal log of master-0 (rendezvous host) that bootstrap-kube-controller-manager pod failed to start with crashloopback error as shown below :
Oct 30 23:46:35 openshift-master-0 kubelet.sh[11122]: E1030 23:46:35.281509   11122 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-controller
-manager\" with CrashLoopBackOff: \"back-off 10s restarting failed container=kube-controller-manager pod=bootstrap-kube-controller-manager-openshift-master-0_kube-system(59bfa1a805b5bc1e621
b485bee8944a6)\"" pod="kube-system/bootstrap-kube-controller-manager-openshift-master-0" podUID=59bfa1a805b5bc1e621b485bee8944a6
Oct 30 23:46:36 openshift-master-0 kubelet.sh[11122]: E1030 23:46:36.283747   11122 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-controller
-manager\" with CrashLoopBackOff: \"back-off 10s restarting failed container=kube-controller-manager pod=bootstrap-kube-controller-manager-openshift-master-0_kube-system(59bfa1a805b5bc1e621
b485bee8944a6)\"" pod="kube-system/bootstrap-kube-controller-manager-openshift-master-0" podUID=59bfa1a805b5bc1e621b485bee8944a6
Oct 30 23:46:37 openshift-master-0 kubelet.sh[11122]: E1030 23:46:37.284910   11122 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-controller
-manager\" with CrashLoopBackOff: \"back-off 10s restarting failed container=kube-controller-manager pod=bootstrap-kube-controller-manager-openshift-master-0_kube-system(59bfa1a805b5bc1e621
b485bee8944a6)\"" pod="kube-system/bootstrap-kube-controller-manager-openshift-master-0" podUID=59bfa1a805b5bc1e621b485bee8944a6

d)	We also observed in the background that the following pods failed to start with crash loopback error continuously on master-0 (rendezvous host):
Oct 31 01:42:32 openshift-master-0 kubelet.sh[11122]: E1031 01:42:32.210639   11122 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"cluster-version-operator\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=cluster-version-operator pod=bootstrap-cluster-version-operator-openshift-master-0_openshift-cluster-version(1fe02b9e38781cc99a4ffe0e9086726b)\"" pod="openshift-cluster-version/bootstrap-cluster-version-operator-openshift-master-0" podUID=1fe02b9e38781cc99a4ffe0e9086726b
Oct 31 01:42:33 openshift-master-0 kubelet.sh[11122]: E1031 01:42:33.211270   11122 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-apiserver\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=kube-apiserver pod=bootstrap-kube-apiserver-openshift-master-0_openshift-kube-apiserver(f43d4c9183dea070316db1eeec8ee359)\"" pod="openshift-kube-apiserver/bootstrap-kube-apiserver-openshift-master-0" podUID=f43d4c9183dea070316db1eeec8ee359
Oct 31 01:42:44 openshift-master-0 kubelet.sh[11122]: E1031 01:42:44.211319   11122 pod_workers.go:965] "Error syncing pod, skipping" err="[failed to \"StartContainer\" for \"kube-controller-manager\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=kube-controller-manager pod=bootstrap-kube-controller-manager-openshift-master-0_kube-system(59bfa1a805b5bc1e621b485bee8944a6)\", failed to \"StartContainer\" for \"cluster-policy-controller\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=cluster-policy-controller pod=bootstrap-kube-controller-manager-openshift-master-0_kube-system(59bfa1a805b5bc1e621b485bee8944a6)\"]" pod="kube-system/bootstrap-kube-controller-manager-openshift-master-0" podUID=59bfa1a805b5bc1e621b485bee8944a6
Oct 31 01:42:45 openshift-master-0 kubelet.sh[11122]: E1031 01:42:45.211215   11122 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"cluster-version-operator\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=cluster-version-operator pod=bootstrap-cluster-version-operator-openshift-master-0_openshift-cluster-version(1fe02b9e38781cc99a4ffe0e9086726b)\"" pod="openshift-cluster-version/bootstrap-cluster-version-operator-openshift-master-0" podUID=1fe02b9e38781cc99a4ffe0e9086726b
Oct 31 01:42:56 openshift-master-0 kubelet.sh[11122]: E1031 01:42:56.211955   11122 pod_workers.go:965] "Error syncing pod, skipping" err="[failed to \"StartContainer\" for \"kube-controller-manager\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=kube-controller-manager pod=bootstrap-kube-controller-manager-openshift-master-0_kube-system(59bfa1a805b5bc1e621b485bee8944a6)\", failed to \"StartContainer\" for \"cluster-policy-controller\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=cluster-policy-controller pod=bootstrap-kube-controller-manager-openshift-master-0_kube-system(59bfa1a805b5bc1e621b485bee8944a6)\"]" pod="kube-system/bootstrap-kube-controller-manager-openshift-master-0" podUID=59bfa1a805b5bc1e621b485bee8944a6

e)	We also observed previously that after approximately 16+ hours the cluster recovers on its own.

Version-Release number of selected component (if applicable):

How reproducible:

Always at customer's end

Steps to Reproduce:

1. create agent based ISO
2. boot the system with ISO
3. observe installation which eventually fails

Actual results:

It takes installation to recover in 16+ hours to have cluster installed but again the cluster is unstable

Expected results:

Installation should not fail and should not take this much time

Additional info:

Assignee:: Robert Fournier

Reporter:: Chinmay Deshpande

QA Contact:: Manoj Hans

Need Info From:: Chinmay Deshpande

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2023/11/08 11:41 AM

Updated:: 2023/11/17 5:48 PM

Resolved:: 2023/11/17 5:48 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates