Uploaded image for project: 'Red Hat Workload Availability'
  1. Red Hat Workload Availability
  2. RHWA-100

NHC: E2E tests are unstable

XMLWordPrintable

    • Icon: Task Task
    • Resolution: Done
    • Icon: Undefined Undefined
    • rhwa-25.8
    • None
    • Node Healthcheck
    • None
    • False
    • Hide

      None

      Show
      None
    • False

      During recent NHC end-to-end (e2e) test runs on OpenShift Container Platform (OCP) 4.20, we are consistently observing test failures characterized by `rpc error: code = Unavailable desc = error reading from server: read: connection reset by peer` and `ContainerFailed` errors.

      This issue appears to be related to API server instability occurring specifically after the Node HealthCheck (NHC) tests have completed, during the subsequent steps involved in preparing the Machine Health Check (MHC) tests. The error manifests around here in the `./hack/test-e2e.sh` script.

      This behavior seems to be a new or more frequent occurrence in OCP 4.18+

      *Proposed Solutions / Ideas:*
      1. *Add a retry mechanism:* Implement retries in the `./hack/test-e2e.sh` script for the affected steps.
      2. *Refactor into code:* Move the problematic test preparation steps into the Go test code itself and leverage Ginkgo/Gomega's `Eventually` matcher for more robust and resilient waiting.

      This issue needs to be tracked to ensure the stability of our e2e testing and to investigate potential underlying API server behavior.

              slintes Marc Sluiter
              mshitrit@redhat.com Michael Shitrit
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: