Loading...

XML

Word

Printable

Type: Story
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
- groomed

Story Points:
3
Blocked:
False
Blocked Reason:
None
Ready:
False
Intelligence Requested:
Market:

Original story points:
3
Sprint:
CLOUD Ready for Development, CLOUD Sprint 241, CLOUD Sprint 242, CLOUD Sprint 243, CLOUD Sprint 244, CLOUD Sprint 253, CLOUD Sprint 254, CLOUD Sprint 255, CLOUD Sprint 256, CLOUD Sprint 257, CLOUD Sprint 258, CLOUD Sprint 259, CLOUD Sprint 260, CLOUD Sprint 261

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

User Story

As a developer of CPMS I want to ensure unhealthy nodes can be replaced so that we can recommend to users to use CPMS

Background

QE have some manual test cases that test a couple of unhappy scenarios for the CPMS, that should result in automatic recovery.

I would like to see these automated as part of the periodic suite for CPMS.

The behaviour itself isn't really dependent on CPMS, but, the whole workflow is.
The behaviour is primarily based on other components and how they react, but block CPMS from operating as expected.

The two cases I would like to see added are:

Terminate an instance on the cloud provider
- Once terminated, the node object should get removed
- Once the node object is removed, the machine should enter a failed state
- Terminate the Machine
- Eventually a new Machine comes up
- Eventually the old Machine goes away
- Eventually the cluster stabilises
Terminate the kubelet on the node
- SSH to the node and terminate kubelet
- Eventually the node will go into unready (condition)
- Delete the Machine object (MHC would do this in the real world)
- Eventually a new Machine becomes ready
- Eventually the old Machine goes away
- Eventually the cluster stabilises

Steps

Review the previous bug and Daniel's work to understand what got broken
Understand how to terminate an instance in the cloud from a test
Understand how to stop kubelet from a test (oc debug?, SSH creds in cluster?)
Write the tests as described above
Ensure the tests actually pass

Stakeholders

Cluster Infra
TRT
etcd

Definition of Done

Tests are added to the existing periodic suite

links to

openshift/cluster-control-plane-machine-set-operator#239: OCPCLOUD-2167: Test unhealthy node cases

openshift/cluster-control-plane-machine-set-operator#262: Revert #239 "OCPCLOUD-2167: Test unhealthy node cases"

openshift/cluster-control-plane-machine-set-operator#263: OCPBUGS-22864: Revert "Revert #239 "OCPCLOUD-2167: Test unhealthy node cases""

Assignee:: Joel Speed

Reporter:: Joel Speed

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2023/08/24 12:34 PM

Updated:: 2025/03/06 2:43 PM

Details

Description

User Story

Background

Steps

Stakeholders

Definition of Done

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates