Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-30634

[OCP 4.12] Baremetal installation failling sometimes with timeout

XMLWordPrintable

    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Requesting help to RCA of a failed installation with an ABI installation on baremetal hardware. The installation fails with timeout (1h), some issues are spotted but still looking to reconstruct all the situation.
      It is a disconnected environment with anonymized data. Workaround was to install 4.14.10.

      Version-Release number of selected component (if applicable):

      4.12.28

      How reproducible:

      the installation randomly fail, most of the times but not always

      Steps to Reproduce:

          1. 
          2.
          3.
          

      Actual results:

      Cluster failed to install most of the times

      Expected results:

      Cluster installation succeeds

      Additional info:

      - The installation period was from:
        install_started_at: 2024-02-14 09:08:24.617000+00:00
        install_completed_at: 2024-02-14 10:15:06.758000+00:00- 
      
      - bootstrap was still active when it failed, masters 1 and 2 reported still installing progress:
      {'current_stage': 'Configuring', 'installation_percentage': 71, 'stage_started_at': '2024-02-14T09:48:27.164Z', 'stage_updated_at': '2024-02-14T09:48:27.164Z'}
      
      - One of the masters was clearly behind in number of pods: 
      $ for i in sosreport-host0-2024-02-14-*; do echo $i; ls -l $i/var/log/pods | wc -l; done
      sosreport-host0-2024-02-14-lggtczy
      79
      sosreport-host0-2024-02-14-nnkxviw
      56
      sosreport-host0-2024-02-14-zwqlnpj
      4
      
      - In that master with only 4 pods there is an interesting thing happening: 
      
      $ grep "invoked oom-killer" journalctl_--no-pager
      Feb 14 09:11:28 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999
      Feb 14 09:11:30 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999
      Feb 14 09:11:44 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999
      Feb 14 09:12:13 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999
      Feb 14 09:12:55 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999
      Feb 14 09:14:23 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999
      Feb 14 09:17:15 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999
      Feb 14 09:22:18 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999
      Feb 14 09:27:29 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999
      Feb 14 09:32:40 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999
      Feb 14 09:37:52 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999
      Feb 14 09:43:03 host0 kernel: machine-config- invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=999
      
      This doesn't happen on the other two hosts.
      
      - The pod finally manage to get started bootkube.sh kick in:
      
      Feb 14 09:53:11 host0 bootkube.sh[60437]: Starting cluster-bootstrap... Feb 14 09:53:11 host0 bootkube.sh[63194]: Starting temporary bootstrap control plane... Feb 14 09:53:11 host0 bootkube.sh[63194]: Waiting up to 20m0s for the Kubernetes API
      
      But still the cluster didn't get installed on time.

              bfournie@redhat.com Robert Fournier
              rhn-support-mabajodu Mario Abajo Duran
              Manoj Hans Manoj Hans
              Mario Abajo Duran
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: