Uploaded image for project: 'OpenShift Windows Containers'
  1. OpenShift Windows Containers
  2. WINC-1620

Add test coverage: Verify node drain handles emptydir volumes during WMCO operations

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • False
    • 5
    • None
    • None
    • None

      Background

      Test coverage for OCPBUGS-27300 which was fixed in 4.15.0. The bug prevented proper node draining during WMCO upgrades when pods had emptydir volumes attached.

      Original Issue (Fixed):

      • OCPBUGS-27300: Node drain does not work correctly with local-data pods (Closed - Done-Errata)
      • Fix Version: 4.15.0
      • Customer Impact: Ford Motor Company, Aareal Bank AG
      • Related: OCPQE-18994 (QE test coverage task)

      Problem:
      During WMCO upgrades, nodes did not correctly drain pods with emptydir/local storage. WMCO was missing the DeleteEmptyDirData field in the node drain helper struct, causing:

      • Error: "cannot delete Pods with local storage"
      • All nodes cordoned simultaneously instead of rolling upgrade
      • Upgrade failures requiring manual intervention

      Fix:
      Added DeleteEmptyDirData: true to node drain helper struct in WMCO 4.15.0+

      Test Objective

      Validate that Windows nodes can be drained successfully when running workloads with emptydir volumes, and that draining happens in a controlled rolling fashion (not all nodes at once).

      Test Design

      Test Case: OCP-XXXXX - Verify node drain handles emptydir volumes during WMCO operations

      Workload Setup

      Deploy Windows workload with emptydir volumes:

      apiVersion: apps/v1
      kind: Deployment
      metadata:   name: windows-emptydir-test
      spec:   replicas: 3  # Spread across multiple Windows nodes
        selector:     matchLabels:       app: emptydir-test
        template:     spec:       nodeSelector:         kubernetes.io/os: windows
            containers: **** name: iis
              image: mcr.microsoft.com/windows/servercore/iis:windowsservercore
              volumeMounts: ***** name: temp-storage
                mountPath: C:\\temp
              command: ***** powershell
      ***** -Command
      ***** |
                # Write data to emptydir to simulate local storage
                while ($true) {
                  Get-Date | Out-File C:\\temp\\timestamp.txt -Append
                  Start-Sleep -Seconds 10
                }
            volumes: **** name: temp-storage
              emptyDir: {}
      

      Test Steps

      1. Deploy workload with emptydir volumes
        • Create deployment with 3+ replicas across Windows nodes
        • Verify pods are running and writing to emptydir
        • Confirm pods distributed across multiple nodes
      2. Trigger node drain scenario (primary: Option A)
        • Option A (Preferred): Trigger WMCO upgrade by changing operator version
        • Option B (Alternative): Manually cordon and drain a Windows node
        • Option C (Alternative): Patch Windows node annotation to trigger WMCO reconciliation
      3. Monitor drain behavior:
        • Watch for drain errors in WMCO logs: "cannot delete Pods with local storage"
        • Monitor node drain sequence to ensure rolling drain (only one node at a time)
        • Watch pod evictions and rescheduling
      4. Validate drain behavior:
        • NO errors: "cannot delete Pods with local storage"
        • Pods with emptydir volumes are evicted successfully
        • Only ONE node drains at a time (rolling drain, not all nodes cordoned)
        • Pods are rescheduled to other available nodes
        • Workload maintains minimum availability during drain
      5. Verify completion:
        • All Windows nodes completed drain/upgrade
        • All workload pods running correctly
        • No pods stuck in pending/evicting state

      Key Assertions

      g.By("Verify no 'cannot delete Pods with local storage' errors in WMCO logs")
      wmcoLogs := getWMCOLogs(oc, startTime)
      o.Expect(wmcoLogs).NotTo(o.ContainSubstring("cannot delete Pods with local storage"))
      
      g.By("Verify only one node draining at a time (not all nodes cordoned)")
      cordonedNodes := getCordonedWindowsNodes(oc)
      o.Expect(len(cordonedNodes)).To(o.BeNumerically("<=", 1), 
          "Multiple nodes cordoned simultaneously - not rolling upgrade")
      
      g.By("Verify pods with emptydir were successfully evicted and rescheduled")
      pods := getPodsWithLabel(oc, "app=emptydir-test")
      o.Expect(len(pods)).To(o.Equal(3), "Expected 3 replicas running")
      for _, pod := range pods {
          o.Expect(pod.Status.Phase).To(o.Equal("Running"))
      }
      

      Test Variations (Optional)

      Variation 1: During WMCO Upgrade

      • Trigger actual WMCO version upgrade
      • Validates full upgrade path

      Variation 2: Manual Node Drain

      • Use oc adm drain on Windows node
      • Validates drain helper works correctly

      Variation 3: Multiple Workload Types

      • StatefulSet with emptydir
      • DaemonSet with emptydir
      • Multiple emptydir volumes per pod

      Why This Test Matters

      1. Regression Prevention: Ensures the fix for OCPBUGS-27300 doesn't regress
      2. Customer Impact: Ford Motor Company and other customers hit this issue
      3. Critical Path: Node drain is essential for upgrades and maintenance
      4. Real-world Scenario: Many Windows workloads use emptydir for temp files, caching, etc.

      Implementation Notes

      Helper Functions Needed

      • getWMCOLogs(oc, startTime) - Fetch WMCO logs since start time
      • getCordonedWindowsNodes(oc) - List cordoned Windows nodes
      • getPodsWithLabel(oc, label) - Get pods by label selector
      • triggerWMCOUpgrade(oc, version) - Trigger WMCO upgrade (if using Option A)

      Platforms to Test

      • AWS IPI
      • Azure IPI
      • GCP IPI
      • vSphere (optional)

      Acceptance Criteria

      • [ ] Create Polarion test case (OCP-XXXXX)
      • [ ] Implement test automation in test/extended/winc/winc.go
      • [ ] Add helper functions to test/extended/winc/utils.go if needed
      • [ ] Test validates all key assertions
      • [ ] Test covers rolling drain behavior
      • [ ] CI passes on AWS, Azure, GCP
      • [ ] Test merged to master
      • [ ] Polarion test case marked as automated
      • [ ] OCPQE-18994 marked as complete

      Related Issues

      • OCPBUGS-27300: Node drain does not work correctly with local-data pods (Closed - Fixed in 4.15.0)
      • OCPQE-18994: QE test coverage task (To Do)
      • OCPBUGS-22711: Backport to 4.14 (Closed)
      • OCPBUGS-18334: Duplicate issue (Closed)

              rrasouli Aharon Rasouli
              rrasouli Aharon Rasouli
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: