-
Story
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
3
-
False
-
None
-
False
-
-
-
3
-
CLOUD Ready for Development, CLOUD Sprint 241, CLOUD Sprint 242, CLOUD Sprint 243, CLOUD Sprint 244, CLOUD Sprint 253, CLOUD Sprint 254, CLOUD Sprint 255, CLOUD Sprint 256, CLOUD Sprint 257, CLOUD Sprint 258, CLOUD Sprint 259, CLOUD Sprint 260, CLOUD Sprint 261
User Story
As a developer of CPMS I want to ensure unhealthy nodes can be replaced so that we can recommend to users to use CPMS
Background
QE have some manual test cases that test a couple of unhappy scenarios for the CPMS, that should result in automatic recovery.
I would like to see these automated as part of the periodic suite for CPMS.
The behaviour itself isn't really dependent on CPMS, but, the whole workflow is.
The behaviour is primarily based on other components and how they react, but block CPMS from operating as expected.
The two cases I would like to see added are:
- Terminate an instance on the cloud provider
- Once terminated, the node object should get removed
- Once the node object is removed, the machine should enter a failed state
- Terminate the Machine
- Eventually a new Machine comes up
- Eventually the old Machine goes away
- Eventually the cluster stabilises
- Terminate the kubelet on the node
- SSH to the node and terminate kubelet
- Eventually the node will go into unready (condition)
- Delete the Machine object (MHC would do this in the real world)
- Eventually a new Machine becomes ready
- Eventually the old Machine goes away
- Eventually the cluster stabilises
Steps
- Review the previous bug and Daniel's work to understand what got broken
- Understand how to terminate an instance in the cloud from a test
- Understand how to stop kubelet from a test (oc debug?, SSH creds in cluster?)
- Write the tests as described above
- Ensure the tests actually pass
Stakeholders
- Cluster Infra
- TRT
- etcd
Definition of Done
- Tests are added to the existing periodic suite
- links to