-
Bug
-
Resolution: Done-Errata
-
Critical
-
None
-
4.14.0
-
Moderate
-
No
-
3
-
Sprint 243 - OSIntegration, Sprint 244 - OSIntegration
-
2
-
Approved
-
False
-
-
N/A
-
Release Note Not Required
Description of problem
Rehearsing hive e2e tests against 4.14 nightly has been failing consistently. The failing section is testing hive MachinePools, which generate and scale MachineSets on the spoke (target cluster). The failure happens at any of various points in this test where we're scaling up: one or more Machines hang in the Provisioned state; and the test times out after 15m waiting for the corresponding Node(s) to appear and become healthy.
I reproduced this locally and looked at the instances in the AWS console. They show 1/2 status checks failing. The bad one says "Instance reachability check failed".
I'm attaching serial console logs from a bad instance as well as a good one. (These are my first ever: I don't know how to read them, or even if I captured them correctly. Please let me know if you need something else/again/different.)
Version-Release number of selected component (if applicable)
4.14 nightlies (candidate stream) for at least a couple months.
How reproducible:
Very. I won't say 100%, but it's close.
Steps to Reproduce
Via hive:
1. Provision a spoke on AWS using a 4.14 nightly release image
2. Set CLUSTER_NAME and CLUSTER_NAMESPACE env vars
3. Run go test ./test/e2e/postinstall/machinesets/...
Test will (usually) fail, complaining of timeout waiting for nodes.
Without hive (speculative):
1. Install a 4.14 on AWS
2. Scale the default worker pool down to 1 replica.
3. Scale it back up to 3 replicas
4. Watch machines/nodes. One or more will get stuck.
Actual results
Nodes don't become healthy.
Expected results
Nodes become healthy
Additional info
I have an environment set up where I can reproduce this, usually within tens of minutes. Let me know if you want access.
- clones
-
OCPBUGS-20198 4.14/AWS: Machines using m4 instance types don't get network
- Closed
- depends on
-
OCPBUGS-20198 4.14/AWS: Machines using m4 instance types don't get network
- Closed
- duplicates
-
OCPBUGS-16724 c4.* instanceType stuck in Provisioned on AWS 4.14
- Closed
- is blocked by
-
OCPBUGS-20357 [4.14] Bootimage bump tracker
- Closed
- is duplicated by
-
OCPBUGS-19870 permafailing install on some jobs: CSR never created (possibly aws m4 instance related)
- Closed
- is related to
-
OCPBUGS-19870 permafailing install on some jobs: CSR never created (possibly aws m4 instance related)
- Closed
- relates to
-
HIVE-2232 Failed to provision hive cluster using 4.14 nightly image.
- Closed
- links to
-
RHSA-2023:5006 OpenShift Container Platform 4.14.0 security update