Loading...

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: 4.13.z
Affects Version/s: 4.12.z
Component/s: Windows Containers
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
0
Severity:
None
Regression:
No

Target Backport Versions:

4.13.z, 4.12.z, 4.14.z, 4.15.z
Target Version:

4.13.z
Release Blocker:
None
Sprint:
WINC - Sprint 248, WINC - Sprint 249
sprint_count:
2

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
In Progress
Release Note Type:
Bug Fix
Release Note Text:

Hide
Under certain conditions and after more than an hour of runtime, Workloads on Windows Server 2019 had the potential to experience packet loss when communicating with other containers in the cluster. This was determined to be due to routing issues present in Windows Server 2019. This was fixed by enabling direct server return (DSR) routing within kube-proxy. DSR is an implementation of asymmetric network load distribution, causing request and response traffic to use a different network path, circumventing the bug within Windows Server 2019. There is better community testing around DSR routing, and it will be used moving forward.

Show
Under certain conditions and after more than an hour of runtime, Workloads on Windows Server 2019 had the potential to experience packet loss when communicating with other containers in the cluster. This was determined to be due to routing issues present in Windows Server 2019. This was fixed by enabling direct server return (DSR) routing within kube-proxy. DSR is an implementation of asymmetric network load distribution, causing request and response traffic to use a different network path, circumventing the bug within Windows Server 2019. There is better community testing around DSR routing, and it will be used moving forward.

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

This is a clone of issue ~~OCPBUGS-28226~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-27503~~. The following is the description of the original issue:

This is a clone of issue ~~OCPBUGS-26761~~. The following is the description of the original issue:
—
Description of problem:

Workloads on Windows Server 2019, are having packet loss. When a pod health check is trying to reach a Windows webserver pod IP, the destination IP of reply packets are getting changed to some other IP(pod IP) causing the packet to never reach the host layer.

SourceVIP on the HNS Load Balancer is being set to the host IP. This should not be the case.

Version-Release number of selected component (if applicable):

OCP 4.12

How reproducible:

Occurs only on Windows 2019 after several hours of runtime.

Steps to Reproduce:

    1. Add a Windows 2019 node to OCP cluster
    2. Deploy 5-10 Windows pods behind a cluster IP

kind: Deployment
apiVersion: apps/v1
metadata:
  name: win-webserverlog-2019
  labels:
    app: win-webserver-log-2019
spec:
  replicas: 5
  selector:
    matchLabels:
      app: win-webserver-log-2019
  template:
    metadata:
      name: win-webserverlog-2019
      labels:
        app: win-webserver-log-2019
    spec:
      nodeSelector:
        kubernetes.io/os: windows
      restartPolicy: Always
      runtimeClassName: windows-2019
      containers:
        - resources: {}
          readinessProbe:
            httpGet:
              path: /
              port: 80
              scheme: HTTP
            initialDelaySeconds: 20
            timeoutSeconds: 2
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 10
          terminationMessagePath: /dev/termination-log
          name: windowswebserverlog
          livenessProbe:
            httpGet:
              path: /
              port: 80
              scheme: HTTP
            initialDelaySeconds: 10
            timeoutSeconds: 2
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 10
          imagePullPolicy: IfNotPresent
          volumeMounts:
            - name: foo
              mountPath: 'C:\Temp\pod'
          terminationMessagePolicy: File
          image: 'mcr.microsoft.com/windows/servercore/iis:windowsservercore-ltsc2019'
      volumes:
        - name: foo
          hostPath:
            path: 'C:\Temp\pod'
            type: ''
      dnsPolicy: ClusterFirst
      tolerations:
        - key: os
          value: Windows
---
kind: Service
apiVersion: v1
metadata:
  name: windows-service-2019
spec:
  ipFamilies:
    - IPv4
  ports:
    - protocol: TCP
      port: 8080
      targetPort: 80
  internalTrafficPolicy: Cluster
  type: ClusterIP
  ipFamilyPolicy: SingleStack
  sessionAffinity: None
  selector:
    app: win-webserver-log-2019
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: windows-2022
handler: 'runhcs-wcow-process'
scheduling:
  nodeSelector:
    kubernetes.io/os: 'windows'
    kubernetes.io/arch: 'amd64'
    node.kubernetes.io/windows-build: '10.0.20348'
  tolerations:
  - effect: NoSchedule
    key: os
    operator: Equal
    value: "windows"

    3. With no changes, pods will restart on their own due within 1-2 hours due to health probe timeouts
    4. More the number of pods, more the changes of pods restarting quickly

Actual results:

All the Pod restarts or a good number of pods restarts on a node due to Kube health check timeout

Expected results:

Healthcheck packets from Kubelet should not be dropped and result in UN-necessary pod restarts

Additional info:

blocks

OCPBUGS-28254 Packet loss on Windows 2019 nodes

Closed

clones

OCPBUGS-28226 Packet loss on Windows 2019 nodes

Closed

is blocked by

OCPBUGS-28226 Packet loss on Windows 2019 nodes

Closed

is cloned by

OCPBUGS-28254 Packet loss on Windows 2019 nodes

Closed

links to

openshift/windows-machine-config-operator#2042: [release-4.13] OCPBUGS-28253: Use DSR load balancing in kube-proxy

RHSA-2023:125566 Red Hat OpenShift for Windows Containers 8.1.2 security update

mentioned on

Merge request - Updated US source to: ff5cd4c Merge pull request #2042 from sebsoto/dsrCherrypick413

(1 links to, 1 mentioned on)

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates