Uploaded image for project: 'OpenShift Installer'
  1. OpenShift Installer
  2. CORS-1510

Reduce the number of times the team has to triage the issue behind bootstrap failures

XMLWordPrintable

    • 4.7 Reduce bootstrap failure triage interactions
    • Done
    • OCPPLAN-4444 - Installer Sustainability
    • OCPPLAN-4444Installer Sustainability
    • 0% To Do, 0% In Progress, 100% Done

      Goal:

      As a member of installer team I would like to reduce the number of times I have to triage the underlying issue of bootstrap failure

      Problem:

      Currently the installer's failure for bootstrap failed is very generic and requires people on the team to look at the log bundle (if provided) to sort through the logs from various machines to figure out the reasons behind bootstrap failure.

      There are some very known failure cases like access to pull container images is wrong, or DNS setup is missing or the control plane machines failed to boot etc. And there are other failures caused by operators on the bootstrap like the etcd is failing to start or the bootstrap control plane is broken. All these failures all look to the user same as "bootstrap failed" and therefore these become bugs for installer team to triage.

       

      Why is this important:

      • The amount of bugs that end up in "bootstrap failed" bucket is quite large and triaging each one of them takes time. And since the error is same it causes users to join a bug with same sympton when the reason is actually different causing noise in the bugs.

      Previous Work:

      Prioritized epics + deliverables (in scope / not in scope):

      • Differentiate errors between failed to bring up bootstrap control plane and the bootstrapping the real API failed.
      • Identify most common failures for failing to bring up bootstrap control plane
      • Codify the failures identified above and bring them up to user as error
      • Identify most common failures for failing to complete bootstrapping to real API
      • Codify the failures identified above and bring them up to user as error
      • Identify when certain failures can be triaged to different operator teams.

      Estimate (XS, S, M, L, XL, XXL): L

       

              adahiyaredhat Abhinav Dahiya (Inactive)
              adahiyaredhat Abhinav Dahiya (Inactive)
              To Hung Sze To Hung Sze
              Votes:
              1 Vote for this issue
              Watchers:
              14 Start watching this issue

                Created:
                Updated:
                Resolved: