-
Task
-
Resolution: Done
-
Undefined
-
None
-
None
During recent NHC end-to-end (e2e) test runs on OpenShift Container Platform (OCP) 4.20, we are consistently observing test failures characterized by `rpc error: code = Unavailable desc = error reading from server: read: connection reset by peer` and `ContainerFailed` errors.
This issue appears to be related to API server instability occurring specifically after the Node HealthCheck (NHC) tests have completed, during the subsequent steps involved in preparing the Machine Health Check (MHC) tests. The error manifests around here in the `./hack/test-e2e.sh` script.
This behavior seems to be a new or more frequent occurrence in OCP 4.18+
*Proposed Solutions / Ideas:*
1. *Add a retry mechanism:* Implement retries in the `./hack/test-e2e.sh` script for the affected steps.
2. *Refactor into code:* Move the problematic test preparation steps into the Go test code itself and leverage Ginkgo/Gomega's `Eventually` matcher for more robust and resilient waiting.
This issue needs to be tracked to ensure the stability of our e2e testing and to investigate potential underlying API server behavior.