-
Bug
-
Resolution: Duplicate
-
Normal
-
None
-
4.12
-
No
-
False
-
Description of problem:
Requesting help to RCA of a failed installation with an ABI installation on baremetal hardware. The installation fails with timeout (1h), some issues are spotted but still looking to reconstruct all the situation. It is a disconnected environment with anonymized data. Workaround was to install 4.14.10.
Version-Release number of selected component (if applicable):
4.12.28
How reproducible:
the installation randomly fail, most of the times but not always
Steps to Reproduce:
1. 2. 3.
Actual results:
Cluster failed to install most of the times
Expected results:
Cluster installation succeeds
Additional info:
- The installation period was from: install_started_at: 2024-02-14 09:08:24.617000+00:00 install_completed_at: 2024-02-14 10:15:06.758000+00:00- - bootstrap was still active when it failed, masters 1 and 2 reported still installing progress: {'current_stage': 'Configuring', 'installation_percentage': 71, 'stage_started_at': '2024-02-14T09:48:27.164Z', 'stage_updated_at': '2024-02-14T09:48:27.164Z'} - One of the masters was clearly behind in number of pods: $ for i in sosreport-host0-2024-02-14-*; do echo $i; ls -l $i/var/log/pods | wc -l; done sosreport-host0-2024-02-14-lggtczy 79 sosreport-host0-2024-02-14-nnkxviw 56 sosreport-host0-2024-02-14-zwqlnpj 4 - In that master with only 4 pods there is an interesting thing happening: $ grep "invoked oom-killer" journalctl_--no-pager Feb 14 09:11:28 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999 Feb 14 09:11:30 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999 Feb 14 09:11:44 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999 Feb 14 09:12:13 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999 Feb 14 09:12:55 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999 Feb 14 09:14:23 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999 Feb 14 09:17:15 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999 Feb 14 09:22:18 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999 Feb 14 09:27:29 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999 Feb 14 09:32:40 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999 Feb 14 09:37:52 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999 Feb 14 09:43:03 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999 This doesn't happen on the other two hosts. - The pod finally manage to get started bootkube.sh kick in: Feb 14 09:53:11 host0 bootkube.sh[60437]: Starting cluster-bootstrap... Feb 14 09:53:11 host0 bootkube.sh[63194]: Starting temporary bootstrap control plane... Feb 14 09:53:11 host0 bootkube.sh[63194]: Waiting up to 20m0s for the Kubernetes API But still the cluster didn't get installed on time.
- duplicates
-
OCPBUGS-30635 [OCP 4.12] Baremetal installation failling sometimes with timeout
- Closed