Loading...

Type: Story
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Link:
None
Story Points:
5

Target Version:
None
Release Blocker:
None
Sprint:
None

Background

Test coverage for ~~OCPBUGS-27300~~ which was fixed in 4.15.0. The bug prevented proper node draining during WMCO upgrades when pods had emptydir volumes attached.

Original Issue (Fixed):

~~OCPBUGS-27300~~: Node drain does not work correctly with local-data pods (Closed - Done-Errata)
Fix Version: 4.15.0
Customer Impact: Ford Motor Company, Aareal Bank AG
Related: OCPQE-18994 (QE test coverage task)

Problem:
During WMCO upgrades, nodes did not correctly drain pods with emptydir/local storage. WMCO was missing the DeleteEmptyDirData field in the node drain helper struct, causing:

Error: "cannot delete Pods with local storage"
All nodes cordoned simultaneously instead of rolling upgrade
Upgrade failures requiring manual intervention

Fix:
Added DeleteEmptyDirData: true to node drain helper struct in WMCO 4.15.0+

Test Objective

Validate that Windows nodes can be drained successfully when running workloads with emptydir volumes, and that draining happens in a controlled rolling fashion (not all nodes at once).

Test Design

Test Case: OCP-XXXXX - Verify node drain handles emptydir volumes during WMCO operations

Workload Setup

Deploy Windows workload with emptydir volumes:

apiVersion: apps/v1
kind: Deployment
metadata:   name: windows-emptydir-test
spec:   replicas: 3  # Spread across multiple Windows nodes
  selector:     matchLabels:       app: emptydir-test
  template:     spec:       nodeSelector:         kubernetes.io/os: windows
      containers: **** name: iis
        image: mcr.microsoft.com/windows/servercore/iis:windowsservercore
        volumeMounts: ***** name: temp-storage
          mountPath: C:\\temp
        command: ***** powershell
***** -Command
***** |
          # Write data to emptydir to simulate local storage
          while ($true) {
            Get-Date | Out-File C:\\temp\\timestamp.txt -Append
            Start-Sleep -Seconds 10
          }
      volumes: **** name: temp-storage
        emptyDir: {}

Test Steps

Deploy workload with emptydir volumes
- Create deployment with 3+ replicas across Windows nodes
- Verify pods are running and writing to emptydir
- Confirm pods distributed across multiple nodes
Trigger node drain scenario (primary: Option A)
- Option A (Preferred): Trigger WMCO upgrade by changing operator version
- Option B (Alternative): Manually cordon and drain a Windows node
- Option C (Alternative): Patch Windows node annotation to trigger WMCO reconciliation
Monitor drain behavior:
- Watch for drain errors in WMCO logs: "cannot delete Pods with local storage"
- Monitor node drain sequence to ensure rolling drain (only one node at a time)
- Watch pod evictions and rescheduling
Validate drain behavior:
- NO errors: "cannot delete Pods with local storage"
- Pods with emptydir volumes are evicted successfully
- Only ONE node drains at a time (rolling drain, not all nodes cordoned)
- Pods are rescheduled to other available nodes
- Workload maintains minimum availability during drain
Verify completion:
- All Windows nodes completed drain/upgrade
- All workload pods running correctly
- No pods stuck in pending/evicting state

Key Assertions

g.By("Verify no 'cannot delete Pods with local storage' errors in WMCO logs")
wmcoLogs := getWMCOLogs(oc, startTime)
o.Expect(wmcoLogs).NotTo(o.ContainSubstring("cannot delete Pods with local storage"))

g.By("Verify only one node draining at a time (not all nodes cordoned)")
cordonedNodes := getCordonedWindowsNodes(oc)
o.Expect(len(cordonedNodes)).To(o.BeNumerically("<=", 1), 
    "Multiple nodes cordoned simultaneously - not rolling upgrade")

g.By("Verify pods with emptydir were successfully evicted and rescheduled")
pods := getPodsWithLabel(oc, "app=emptydir-test")
o.Expect(len(pods)).To(o.Equal(3), "Expected 3 replicas running")
for _, pod := range pods {
    o.Expect(pod.Status.Phase).To(o.Equal("Running"))
}

Test Variations (Optional)

Variation 1: During WMCO Upgrade

Trigger actual WMCO version upgrade
Validates full upgrade path

Variation 2: Manual Node Drain

Use oc adm drain on Windows node
Validates drain helper works correctly

Variation 3: Multiple Workload Types

StatefulSet with emptydir
DaemonSet with emptydir
Multiple emptydir volumes per pod

Why This Test Matters

Regression Prevention: Ensures the fix for ~~OCPBUGS-27300~~ doesn't regress
Customer Impact: Ford Motor Company and other customers hit this issue
Critical Path: Node drain is essential for upgrades and maintenance
Real-world Scenario: Many Windows workloads use emptydir for temp files, caching, etc.

Implementation Notes

Helper Functions Needed

getWMCOLogs(oc, startTime) - Fetch WMCO logs since start time
getCordonedWindowsNodes(oc) - List cordoned Windows nodes
getPodsWithLabel(oc, label) - Get pods by label selector
triggerWMCOUpgrade(oc, version) - Trigger WMCO upgrade (if using Option A)

Platforms to Test

AWS IPI
Azure IPI
GCP IPI
vSphere (optional)

Acceptance Criteria

[ ] Create Polarion test case (OCP-XXXXX)
[ ] Implement test automation in test/extended/winc/winc.go
[ ] Add helper functions to test/extended/winc/utils.go if needed
[ ] Test validates all key assertions
[ ] Test covers rolling drain behavior
[ ] CI passes on AWS, Azure, GCP
[ ] Test merged to master
[ ] Polarion test case marked as automated
[ ] OCPQE-18994 marked as complete

Related Issues

~~OCPBUGS-27300~~: Node drain does not work correctly with local-data pods (Closed - Fixed in 4.15.0)
OCPQE-18994: QE test coverage task (To Do)
~~OCPBUGS-22711~~: Backport to 4.14 (Closed)
OCPBUGS-18334: Duplicate issue (Closed)

relates to

OCPBUGS-27300 Node drain does not work correctly with local-data pods

Closed

Details

Description