Uploaded image for project: 'OpenShift Specialist Platform Team'
  1. OpenShift Specialist Platform Team
  2. SPLAT-1870

[pext-aws][CI] Collect bootstrap logs from SSH when API is down

    • Icon: Task Task
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • None
    • 3

      As an OCP Engineer, I would like to know the reason bootstrap failed to install on Platform External clusters on AWS when API is is down on step "platform-external-cluster-wait-for-api-bootstrap" by collecting logs through SSH, so I will have more information while investigating the root cause of failures.

      Acceptance criteria:

      • Attempt to collect logs from systemd through SSH in the step [1] when the API is down
        • Show preview of logs (tail?) before exit with failure
      • Review, and adjust if needed, the timeout (the jobs[2] took like 20+ minutes to get feedback)

       

      Engineering references:

      [1] step failure e2e-external-aws-ccm-platform-external-cluster-wait-for-api-bootstrap

      ~~~
      E1027 00:49:55.231610 1000 memcache.go:265] couldn't get current server API group list: Get "https://api.ci-op-stflmg87-cd183.origin-ci-int-aws.dev.rhcloud.com:6443/api?timeout=32s": dial tcp 3.23.214.125:6443: connect: connection refused
      386
      The connection to the server api.ci-op-stflmg87-cd183.origin-ci-int-aws.dev.rhcloud.com:6443 was refused - did you specify the right host or port?
      387
      2024-10-27 00:49:55+00:00 - API DOWN, waiting 30s...
      388

      {"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:267","func":"sigs.k8s.io/prow/pkg/entrypoint.gracefullyTerminate","level":"error","msg":"Process did not exit before 10m0s grace period","severity":"error","time":"2024-10-27T00:50:05Z"}

      389

      {"component":"entrypoint","error":"process timed out","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:84","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.internalRun","level":"error","msg":"Error executing test process","severity":"error","time":"2024-10-27T00:50:05Z"}

      390
      error: failed to execute wrapped command: exit status 127
      391
      INFO[2024-10-27T00:50:05Z] Step e2e-external-aws-ccm-platform-external-cluster-wait-for-api-bootstrap failed after 25m10s.
      ~~~

      [2] References / job failures:

      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-external-aws-ccm/1850326952720732160

      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-e2e-external-aws/1850326950212538368

      Reported to Slack channel https://redhat-internal.slack.com/archives/C064717U0D7/p1729990830715289

              Unassigned Unassigned
              rhn-support-mrbraga Marco Braga
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: