Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-27503

Packet loss on Windows 2019 nodes

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Critical Critical
    • 4.15.0
    • 4.12.z
    • Windows Containers
    • None
    • No
    • 0
    • WINC - Sprint 248
    • 1
    • False
    • Hide

      None

      Show
      None
    • Hide
      Under certain conditions and after more than an hour of runtime, Workloads on Windows Server 2019 had the potential to experience packet loss when communicating with other containers in the cluster. This was determined to be due to routing issues present in Windows Server 2019. This was fixed by enabling direct server return (DSR) routing within kube-proxy. DSR is an implementation of asymmetric network load distribution, causing request and response traffic to use a different network path, circumventing the bug within Windows Server 2019. There is better community testing around DSR routing, and it will be used moving forward.
      Show
      Under certain conditions and after more than an hour of runtime, Workloads on Windows Server 2019 had the potential to experience packet loss when communicating with other containers in the cluster. This was determined to be due to routing issues present in Windows Server 2019. This was fixed by enabling direct server return (DSR) routing within kube-proxy. DSR is an implementation of asymmetric network load distribution, causing request and response traffic to use a different network path, circumventing the bug within Windows Server 2019. There is better community testing around DSR routing, and it will be used moving forward.
    • Bug Fix
    • In Progress

      This is a clone of issue OCPBUGS-26761. The following is the description of the original issue:

      Description of problem:

      Workloads on Windows Server 2019, are having packet loss. When a pod health check is trying to reach a Windows webserver pod IP, the destination IP of reply packets are getting changed to some other IP(pod IP) causing the packet to never reach the host layer.
      
      SourceVIP on the HNS Load Balancer is being set to the host IP. This should not be the case.
          

      Version-Release number of selected component (if applicable):

      OCP 4.12
          

      How reproducible:

      Occurs only on Windows 2019 after several hours of runtime.
          

      Steps to Reproduce:

          1. Add a Windows 2019 node to OCP cluster
          2. Deploy 5-10 Windows pods behind a cluster IP
      
      kind: Deployment
      apiVersion: apps/v1
      metadata:
        name: win-webserverlog-2019
        labels:
          app: win-webserver-log-2019
      spec:
        replicas: 5
        selector:
          matchLabels:
            app: win-webserver-log-2019
        template:
          metadata:
            name: win-webserverlog-2019
            labels:
              app: win-webserver-log-2019
          spec:
            nodeSelector:
              kubernetes.io/os: windows
            restartPolicy: Always
            runtimeClassName: windows-2019
            containers:
              - resources: {}
                readinessProbe:
                  httpGet:
                    path: /
                    port: 80
                    scheme: HTTP
                  initialDelaySeconds: 20
                  timeoutSeconds: 2
                  periodSeconds: 10
                  successThreshold: 1
                  failureThreshold: 10
                terminationMessagePath: /dev/termination-log
                name: windowswebserverlog
                livenessProbe:
                  httpGet:
                    path: /
                    port: 80
                    scheme: HTTP
                  initialDelaySeconds: 10
                  timeoutSeconds: 2
                  periodSeconds: 10
                  successThreshold: 1
                  failureThreshold: 10
                imagePullPolicy: IfNotPresent
                volumeMounts:
                  - name: foo
                    mountPath: 'C:\Temp\pod'
                terminationMessagePolicy: File
                image: 'mcr.microsoft.com/windows/servercore/iis:windowsservercore-ltsc2019'
            volumes:
              - name: foo
                hostPath:
                  path: 'C:\Temp\pod'
                  type: ''
            dnsPolicy: ClusterFirst
            tolerations:
              - key: os
                value: Windows
      ---
      kind: Service
      apiVersion: v1
      metadata:
        name: windows-service-2019
      spec:
        ipFamilies:
          - IPv4
        ports:
          - protocol: TCP
            port: 8080
            targetPort: 80
        internalTrafficPolicy: Cluster
        type: ClusterIP
        ipFamilyPolicy: SingleStack
        sessionAffinity: None
        selector:
          app: win-webserver-log-2019
      ---
      apiVersion: node.k8s.io/v1
      kind: RuntimeClass
      metadata:
        name: windows-2019
      handler: 'runhcs-wcow-process'
      scheduling:
        nodeSelector:
          kubernetes.io/os: 'windows'
          kubernetes.io/arch: 'amd64'
          node.kubernetes.io/windows-build: '10.0.17763'
        tolerations:
        - effect: NoSchedule
          key: os
          operator: Equal
          value: "Windows"
      
          3. With no changes, pods will restart on their own due within 1-2 hours due to health probe timeouts
          4. More the number of pods, more the changes of pods restarting quickly
          

      Actual results:

      All the Pod restarts or a good number of pods restarts on a node due to Kube health check timeout

      Expected results:

      Healthcheck packets from Kubelet should not be dropped and result in UN-necessary pod restarts

      Additional info:

          

            rh-ee-ssoto Sebastian Soto
            openshift-crt-jira-prow OpenShift Prow Bot
            Aharon Rasouli Aharon Rasouli
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: