Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-34953

Log bundle analizer gives bogus analyze when failing to download bootstrap logs

XMLWordPrintable

    • -
    • Low
    • No
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, the `openshift-install` CLI sometimes failed to connect to the bootstrap node when collecting bootstrap gather logs. The installation program reported an error message such as `The bootstrap machine did not execute the release-image.service systemd unit`. With this release and after the bootstrap gather logs issue occurs, the installation program now reports `Invalid log bundle or the bootstrap machine could not be reached and bootstrap logs were not collected`, which is a more accurate error message.
      (link:https://issues.redhat.com/browse/OCPBUGS-34953[*OCPBUGS-34953*])
      Show
      * Previously, the `openshift-install` CLI sometimes failed to connect to the bootstrap node when collecting bootstrap gather logs. The installation program reported an error message such as `The bootstrap machine did not execute the release-image.service systemd unit`. With this release and after the bootstrap gather logs issue occurs, the installation program now reports `Invalid log bundle or the bootstrap machine could not be reached and bootstrap logs were not collected`, which is a more accurate error message. (link: https://issues.redhat.com/browse/OCPBUGS-34953 [* OCPBUGS-34953 *])
    • Bug Fix
    • Done

      Description of problem: When the bootstrap times out, the installer tries to download the logs from the bootstrap VM and gives an analysis of what happened. On OpenStack platform, we're currently failing to download the bootstrap logs (tracked in OCPBUGS-34950), which causes the analysis to always return an erroneous message:

      time="2024-06-05T08:34:45-04:00" level=error msg="Bootstrap failed to complete: timed out waiting for the condition"
      time="2024-06-05T08:34:45-04:00" level=error msg="Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane."
      time="2024-06-05T08:34:45-04:00" level=error msg="The bootstrap machine did not execute the release-image.service systemd unit"
      

      The affirmation that the bootstrap machine did not execute the release-image.service systemd unit is wrong, as I can confirm by SSH'ing to the bootstrap node:

      systemctl status release-image.service
      ● release-image.service - Download the OpenShift Release Image
           Loaded: loaded (/etc/systemd/system/release-image.service; static)
           Active: active (exited) since Wed 2024-06-05 11:57:33 UTC; 1h 16min ago
          Process: 2159 ExecStart=/usr/local/bin/release-image-download.sh (code=exited, status=0/SUCCESS)
         Main PID: 2159 (code=exited, status=0/SUCCESS)
              CPU: 47.364s
      
      Jun 05 11:57:05 mandre-tnvc8bootstrap systemd[1]: Starting Download the OpenShift Release Image...
      Jun 05 11:57:06 mandre-tnvc8bootstrap podman[2184]: 2024-06-05 11:57:06.895418265 +0000 UTC m=+0.811028632 system refresh
      Jun 05 11:57:06 mandre-tnvc8bootstrap release-image-download.sh[2159]: Pulling quay.io/openshift-release-dev/ocp-release@sha256:31cdf34b1957996d5c79c48466abab2fcfb9d9843>
      Jun 05 11:57:32 mandre-tnvc8bootstrap release-image-download.sh[2269]: 079f5c86b015ddaf9c41349ba292d7a5487be91dd48e48852d10e64dd0ec125d
      Jun 05 11:57:32 mandre-tnvc8bootstrap podman[2269]: 2024-06-05 11:57:32.82473216 +0000 UTC m=+25.848290388 image pull 079f5c86b015ddaf9c41349ba292d7a5487be91dd48e48852d1>
      Jun 05 11:57:33 mandre-tnvc8bootstrap systemd[1]: Finished Download the OpenShift Release Image.
      

      The installer was just unable to retrieve the bootstrap logs. Earlier, buried in the installer logs, we can see:

      time="2024-06-05T08:34:42-04:00" level=info msg="Failed to gather bootstrap logs: failed to connect to the bootstrap machine: dial tcp 10.196.2.10:22: connect: connection
       timed out"
      

      This is what should be reported by the analyzer.

              rdossant Rafael Fonseca dos Santos
              maandre@redhat.com Martin André
              Gaoyun Pei Gaoyun Pei
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: