-
Bug
-
Resolution: Duplicate
-
Normal
-
None
-
4.12
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
No
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Requesting help to RCA of a failed installation with an ABI installation on baremetal hardware. The installation fails with timeout (1h), some issues are spotted but still looking to reconstruct all the situation. It is a disconnected environment with anonymized data. Workaround was to install 4.14.10.
Version-Release number of selected component (if applicable):
4.12.28
How reproducible:
the installation randomly fail, most of the times but not always
Steps to Reproduce:
1.
2.
3.
Actual results:
Cluster failed to install most of the times
Expected results:
Cluster installation succeeds
Additional info:
- The installation period was from:
install_started_at: 2024-02-14 09:08:24.617000+00:00
install_completed_at: 2024-02-14 10:15:06.758000+00:00-
- bootstrap was still active when it failed, masters 1 and 2 reported still installing progress:
{'current_stage': 'Configuring', 'installation_percentage': 71, 'stage_started_at': '2024-02-14T09:48:27.164Z', 'stage_updated_at': '2024-02-14T09:48:27.164Z'}
- One of the masters was clearly behind in number of pods:
$ for i in sosreport-host0-2024-02-14-*; do echo $i; ls -l $i/var/log/pods | wc -l; done
sosreport-host0-2024-02-14-lggtczy
79
sosreport-host0-2024-02-14-nnkxviw
56
sosreport-host0-2024-02-14-zwqlnpj
4
- In that master with only 4 pods there is an interesting thing happening:
$ grep "invoked oom-killer" journalctl_--no-pager
Feb 14 09:11:28 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999
Feb 14 09:11:30 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999
Feb 14 09:11:44 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999
Feb 14 09:12:13 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999
Feb 14 09:12:55 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999
Feb 14 09:14:23 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999
Feb 14 09:17:15 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999
Feb 14 09:22:18 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999
Feb 14 09:27:29 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999
Feb 14 09:32:40 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999
Feb 14 09:37:52 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999
Feb 14 09:43:03 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999
This doesn't happen on the other two hosts.
- The pod finally manage to get started bootkube.sh kick in:
Feb 14 09:53:11 host0 bootkube.sh[60437]: Starting cluster-bootstrap... Feb 14 09:53:11 host0 bootkube.sh[63194]: Starting temporary bootstrap control plane... Feb 14 09:53:11 host0 bootkube.sh[63194]: Waiting up to 20m0s for the Kubernetes API
But still the cluster didn't get installed on time.
- duplicates
-
OCPBUGS-30635 [OCP 4.12] Baremetal installation failling sometimes with timeout
-
- Closed
-