Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: 4.15.0
Affects Version/s: 4.12.z
Component/s: Windows Containers
Labels:
None

Regression:
No
Story Points:
0
Sprint:
WINC - Sprint 248
sprint_count:
1
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
Under certain conditions and after more than an hour of runtime, Workloads on Windows Server 2019 had the potential to experience packet loss when communicating with other containers in the cluster. This was determined to be due to routing issues present in Windows Server 2019. This was fixed by enabling direct server return (DSR) routing within kube-proxy. DSR is an implementation of asymmetric network load distribution, causing request and response traffic to use a different network path, circumventing the bug within Windows Server 2019. There is better community testing around DSR routing, and it will be used moving forward.

Show
Under certain conditions and after more than an hour of runtime, Workloads on Windows Server 2019 had the potential to experience packet loss when communicating with other containers in the cluster. This was determined to be due to routing issues present in Windows Server 2019. This was fixed by enabling direct server return (DSR) routing within kube-proxy. DSR is an implementation of asymmetric network load distribution, causing request and response traffic to use a different network path, circumventing the bug within Windows Server 2019. There is better community testing around DSR routing, and it will be used moving forward.
Release Note Type:
Bug Fix
Release Note Status:
In Progress
Target Version:

4.15.0
Target Backport Versions:

4.13.z, 4.12.z, 4.14.z, 4.15.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

This is a clone of issue ~~OCPBUGS-26761~~. The following is the description of the original issue:
—
Description of problem:

Workloads on Windows Server 2019, are having packet loss. When a pod health check is trying to reach a Windows webserver pod IP, the destination IP of reply packets are getting changed to some other IP(pod IP) causing the packet to never reach the host layer.

SourceVIP on the HNS Load Balancer is being set to the host IP. This should not be the case.

Version-Release number of selected component (if applicable):

OCP 4.12

How reproducible:

Occurs only on Windows 2019 after several hours of runtime.

Steps to Reproduce:

    1. Add a Windows 2019 node to OCP cluster
    2. Deploy 5-10 Windows pods behind a cluster IP

kind: Deployment
apiVersion: apps/v1
metadata:
  name: win-webserverlog-2019
  labels:
    app: win-webserver-log-2019
spec:
  replicas: 5
  selector:
    matchLabels:
      app: win-webserver-log-2019
  template:
    metadata:
      name: win-webserverlog-2019
      labels:
        app: win-webserver-log-2019
    spec:
      nodeSelector:
        kubernetes.io/os: windows
      restartPolicy: Always
      runtimeClassName: windows-2019
      containers:
        - resources: {}
          readinessProbe:
            httpGet:
              path: /
              port: 80
              scheme: HTTP
            initialDelaySeconds: 20
            timeoutSeconds: 2
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 10
          terminationMessagePath: /dev/termination-log
          name: windowswebserverlog
          livenessProbe:
            httpGet:
              path: /
              port: 80
              scheme: HTTP
            initialDelaySeconds: 10
            timeoutSeconds: 2
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 10
          imagePullPolicy: IfNotPresent
          volumeMounts:
            - name: foo
              mountPath: 'C:\Temp\pod'
          terminationMessagePolicy: File
          image: 'mcr.microsoft.com/windows/servercore/iis:windowsservercore-ltsc2019'
      volumes:
        - name: foo
          hostPath:
            path: 'C:\Temp\pod'
            type: ''
      dnsPolicy: ClusterFirst
      tolerations:
        - key: os
          value: Windows
---
kind: Service
apiVersion: v1
metadata:
  name: windows-service-2019
spec:
  ipFamilies:
    - IPv4
  ports:
    - protocol: TCP
      port: 8080
      targetPort: 80
  internalTrafficPolicy: Cluster
  type: ClusterIP
  ipFamilyPolicy: SingleStack
  sessionAffinity: None
  selector:
    app: win-webserver-log-2019
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: windows-2019
handler: 'runhcs-wcow-process'
scheduling:
  nodeSelector:
    kubernetes.io/os: 'windows'
    kubernetes.io/arch: 'amd64'
    node.kubernetes.io/windows-build: '10.0.17763'
  tolerations:
  - effect: NoSchedule
    key: os
    operator: Equal
    value: "Windows"

    3. With no changes, pods will restart on their own due within 1-2 hours due to health probe timeouts
    4. More the number of pods, more the changes of pods restarting quickly

Actual results:

All the Pod restarts or a good number of pods restarts on a node due to Kube health check timeout

Expected results:

Healthcheck packets from Kubelet should not be dropped and result in UN-necessary pod restarts

Additional info:

blocks

OCPBUGS-28226 Packet loss on Windows 2019 nodes

Closed

clones

OCPBUGS-26761 Packet loss on Windows 2019 nodes

Closed

is blocked by

OCPBUGS-26761 Packet loss on Windows 2019 nodes

Closed

is cloned by

OCPBUGS-28226 Packet loss on Windows 2019 nodes

Closed

links to

openshift/windows-machine-config-operator#2024: [release-4.15] OCPBUGS-27503: Use DSR load balancing in kube-proxy

RHSA-2023:120235 Red Hat OpenShift for Windows Containers 10.15.0 security release

mentioned on

Merge request - Updated US source to: 230b9df Merge pull request #2038 from mtnbikenc/pr-times

Merge request - Updated US source to: 9581032 Merge pull request #2024 from openshift-cherrypick-robot/cherry-pick-2006-to-release-4.15

(1 links to, 2 mentioned on)

Assignee:: Sebastian Soto

Reporter:: OpenShift Prow Bot

QA Contact:: Aharon Rasouli

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2024/01/22 4:50 PM

Updated:: 2024/03/25 6:00 PM

Resolved:: 2024/02/27 3:16 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates