Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-4133

Load Balance service with externalTrafficPolicy="Cluster" for Windows workloads intermittently unavailable in GCP and Azure

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Critical
    • 4.13.0
    • 4.12
    • Windows Containers
    • None
    • 3
    • WINC - Sprint 232
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      When creating services in a OVN-HybridOverlay cluster with Windows workers, we are experiencing intermittent reachability issues for the external-ip when the number of pods from the expose deployment is bigger than 1:
      
      [cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get svc -n winc-38186 
      NAME            TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)        AGE
      win-webserver   LoadBalancer   172.30.38.192   34.136.170.199   80:30246/TCP   41m
      
      cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get deploy -n winc-38186 
      NAME            READY   UP-TO-DATE   AVAILABLE   AGE
      win-webserver   6/6     6            6           42m
      
      [cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get pods -n winc-38186 
      NAME                             READY   STATUS    RESTARTS   AGE
      win-webserver-597fb4c9cc-8ccwg   1/1     Running   0          6s
      win-webserver-597fb4c9cc-f54x5   1/1     Running   0          6s
      win-webserver-597fb4c9cc-jppxb   1/1     Running   0          97s
      win-webserver-597fb4c9cc-twn9b   1/1     Running   0          6s
      win-webserver-597fb4c9cc-x5rfr   1/1     Running   0          6s
      win-webserver-597fb4c9cc-z8sfv   1/1     Running   0          6s
      
      [cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199
      <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199
      <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199
      curl: (7) Failed to connect to 34.136.170.199 port 80: Connection timed out
      [cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199
      <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199
      <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199
      <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa openshift-tests-private]$ curl 34.136.170.199
      curl: (7) Failed to connect to 34.136.170.199 port 80: Connection timed out
      
      When having a look at the Load Balancer service, we can see that the externalTrafficPolicy is of type "Cluster":
      
      [cloud-user@preserve-jfrancoa openshift-tests-private]$ oc get svc -n winc-38186 win-webserver -o yaml
      apiVersion: v1
      kind: Service
      metadata:
        creationTimestamp: "2022-11-25T13:29:00Z"
        finalizers:
        - service.kubernetes.io/load-balancer-cleanup
        labels:
          app: win-webserver
        name: win-webserver
        namespace: winc-38186
        resourceVersion: "169364"
        uid: 4a229123-ee88-47b6-99ce-814522803ad8
      spec:
        allocateLoadBalancerNodePorts: true
        clusterIP: 172.30.38.192
        clusterIPs:
        - 172.30.38.192
        externalTrafficPolicy: Cluster
        internalTrafficPolicy: Cluster
        ipFamilies:
        - IPv4
        ipFamilyPolicy: SingleStack
        ports:
        - nodePort: 30246
          port: 80
          protocol: TCP
          targetPort: 80
        selector:
          app: win-webserver
        sessionAffinity: None
        type: LoadBalancer
      status:
        loadBalancer:
          ingress:
          - ip: 34.136.170.199
      
      
      Recreating the Service setting externalTrafficPolicy to Local seems to solve the issue:  $ oc describe svc win-webserver -n winc-38186
      Name:                     win-webserver
      Namespace:                winc-38186
      Labels:                   app=win-webserver
      Annotations:              <none>
      Selector:                 app=win-webserver
      Type:                     LoadBalancer
      IP Family Policy:         SingleStack
      IP Families:              IPv4
      IP:                       172.30.38.192
      IPs:                      172.30.38.192
      LoadBalancer Ingress:     34.136.170.199
      Port:                     <unset>  80/TCP
      TargetPort:               80/TCP
      NodePort:                 <unset>  30246/TCP
      Endpoints:                10.132.0.18:80,10.132.0.19:80,10.132.0.20:80 + 3 more...
      Session Affinity:         None
      External Traffic Policy:  Cluster
      Events:
        Type    Reason                 Age                 From                Message
        ----    ------                 ----                ----                -------
        Normal  ExternalTrafficPolicy  66m                 service-controller  Cluster -> Local
        Normal  EnsuringLoadBalancer   63m (x3 over 113m)  service-controller  Ensuring load balancer
        Normal  ExternalTrafficPolicy  63m                 service-controller  Local -> Cluster
        Normal  EnsuredLoadBalancer    62m (x3 over 113m)  service-controller  Ensured load balancer 
      
      $ oc get svc -n winc-test
      NAME              TYPE           CLUSTER-IP      EXTERNAL-IP    PORT(S)          AGE
      linux-webserver   LoadBalancer   172.30.175.95   34.136.11.87   8080:30715/TCP   152m
      win-check         LoadBalancer   172.30.50.151   35.194.12.34   80:31725/TCP     4m33s
      win-webserver     LoadBalancer   172.30.15.95    35.226.129.1   80:30409/TCP     152m
      [cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
      <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
      <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
      <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
      <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
      <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
      <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
      <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
      <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
      <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
      <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
      <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
      <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
      <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
      <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
      <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
      <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
      <html><body><H1>Windows Container Web Server</H1></body></html>[cloud-user@preserve-jfrancoa tmp]$ curl 35.194.12.34
      
      While the other service which has externalTrafficPolicy set to "Cluster" is still failing:
      
      [cloud-user@preserve-jfrancoa tmp]$ curl 35.226.129.1
      curl: (7) Failed to connect to 35.226.129.1 port 80: Connection timed out
      

       

      Version-Release number of selected component (if applicable):

      $ oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.12.0-0.nightly-2022-11-24-203151   True        False         7h2m    Cluster version is 4.12.0-0.nightly-2022-11-24-203151
      
      
      $ oc get network cluster -o yaml
      apiVersion: config.openshift.io/v1
      kind: Network
      metadata:
        creationTimestamp: "2022-11-25T06:56:50Z"
        generation: 2
        name: cluster
        resourceVersion: "2952"
        uid: e9ad729c-36a4-4e71-9a24-740352b11234
      spec:
        clusterNetwork:
        - cidr: 10.128.0.0/14
          hostPrefix: 23
        externalIP:
          policy: {}
        networkType: OVNKubernetes
        serviceNetwork:
        - 172.30.0.0/16
      status:
        clusterNetwork:
        - cidr: 10.128.0.0/14
          hostPrefix: 23
        clusterNetworkMTU: 1360
        networkType: OVNKubernetes
        serviceNetwork:
        - 172.30.0.0/16
      

      How reproducible:

      Always, sometimes it takes more curl calls to the External IP, but it always ends up timeouting

      Steps to Reproduce:

      1. Deploy a Windows cluster with OVN-Hybrid overlay on GCP, the following Jenkins job can be used for it: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/158926/
      2. Create a deployment and a service, for example:
      kind: Service
      metadata:
        labels:
          app: win-check
        name: win-check
        namespace: winc-test
      spec:
        #externalTrafficPolicy: Local
        ports:
        - port: 80
          targetPort: 80
        selector:
          app: win-check
        type: LoadBalancer
      ---
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        labels:
          app: win-check
        name: win-check
        namespace: winc-test
      spec:
        replicas: 6
        selector:
          matchLabels:
            app: win-check
        template:
          metadata:
            labels:
              app: win-check
            name: win-check
          spec:
            containers:
            - command:
              - pwsh.exe
              - -command
              - $listener = New-Object System.Net.HttpListener; $listener.Prefixes.Add('http://*:80/');
                $listener.Start();Write-Host('Listening at http://*:80/'); while ($listener.IsListening)
                { $context = $listener.GetContext(); $response = $context.Response; $content='<html><body><H1>Windows
                Container Web Server</H1></body></html>'; $buffer = [System.Text.Encoding]::UTF8.GetBytes($content);
                $response.ContentLength64 = $buffer.Length; $response.OutputStream.Write($buffer,
                0, $buffer.Length); $response.Close(); };
              image: mcr.microsoft.com/powershell:lts-nanoserver-ltsc2022
              name: win-check
              securityContext:
                runAsNonRoot: false
                windowsOptions:
                  runAsUserName: ContainerAdministrator
            nodeSelector:
              kubernetes.io/os: windows
            tolerations:
            - key: os
              value: Windows
        3.Get the external IP for the service: 
      $ oc get svc -n winc-test                                                   
      NAME              TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)          AGE                                            
      linux-webserver   LoadBalancer   172.30.175.95   34.136.11.87     8080:30715/TCP   94m                                            
      win-check         LoadBalancer   172.30.82.251   35.239.175.209   80:30530/TCP     29s                                            
      win-webserver     LoadBalancer   172.30.15.95    35.226.129.1     80:30409/TCP     94m
      
        4. Try to curl the external-ip:
      $ curl 35.239.175.209
      curl: (7) Failed to connect to 35.239.175.209 port 80: Connection timed out
      

       

      Actual results:

      The Load Balancer IP is not reachable, thus impacting in the service availability

      Expected results:

      The Load Balancer IP is available at all times

      Additional info:

       

      Attachments

        Issue Links

          Activity

            People

              rh-ee-mankulka Mansi Kulkarni
              rhn-engineering-jfrancoa Jose Luis Franco Arza (Inactive)
              Anurag Saxena Anurag Saxena
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: