Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-4092

Load balancer shows connectivity outage during Windows nodes upgrade

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 4.12.0
    • 4.11.z
    • Windows Containers
    • None
    • None
    • 0
    • WINC - Sprint 228
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None

      This is a clone of issue OCPBUGS-3506. The following is the description of the original issue:

      Description of problem:

      Workload's load balancer with external IP shows connectivity outage during  Windows node upgrade when using windows/servercore image. During the reconciliation, a new node is being created therefore when the draining before the reconciliation of another node happens, the node does not contain the containers image anymore. If the time required to download the image is longer than the time it takes to reconcile the node we will end up in a situation in which no workload is available to handle the Load Balancer's requests, ending up in a service disruption. 

      Version-Release number of selected component (if applicable):

      4.11

      How reproducible:

      Sometimes

      Steps to Reproduce:

      Create a script to continuously query a load balancer endpoint External IP or DNS name:
      ```
      cat probeLB.sh                                                             
      #!/bin/bash
      set -e
      while true
      do
          date
          echo "curl 52.189.34.88"
          curl 52.189.34.88
          echo ""
          sleep 2
      done
      ```
      
      1. In a OCP cluster deploy WMCO 6.0
      2. Create a Windows machineSet with 3 replicas
      3. Wait for WMCO to configure the Windows nodes
      4. Deploy win-server workloads with at least 3 replicas
      5. Deploy load balancer
      NAME                      TYPE           CLUSTER-IP      EXTERNAL-IP     PORT(S)          AGE
      service/win-webserver     LoadBalancer   172.30.105.53   52.189.34.88    80:30648/TCP     115m
      
      6. Scale down WMCO deployment to 0
      oc scale deployment.apps/windows-machine-config-operator --replicas=0 -n openshift-windows-machine-config-operator
      
      7. Trigger Windows node upgrade by changing the version annotation in all Windows nodes.
      oc annotate node <windows-node-1> --overwrite windowsmachineconfig.openshift.io/version=invalidVersion
      oc annotate node <windows-node-2> --overwrite windowsmachineconfig.openshift.io/version=invalidVersion
      oc annotate node <windows-node-3> --overwrite windowsmachineconfig.openshift.io/version=invalidVersion
      
      8. In a separate terminal, trigger the script to query a load balancer endpoint (probeLB.sh)
      100    63  100    63    0     0    777      0 --:--:-- --:--:-- --:--:--   777
        % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                       Dload  Upload   Total   Spent    Left  Speed
      
      9. Scale up WMCO deployment to 1
      oc scale deployment.apps/windows-machine-config-operator --replicas=1 -n openshift-windows-machine-config-operator
      
      
      10. Watch the script for the load balancer endpoint
      100    63  100    63    0     0    741      0 --:--:-- --:--:-- --:--:--   741
        % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                       Dload  Upload   Total   Spent    Left  Speed
        0     0    0     0    0     0      0      0 --:--:--  0:00:12 --:--:--     0curl: (7) Failed to connect to 52.189.34.88 port 80: Connection refused
      
      

      Actual results:

      Load balancer connectivity lost with Windows nodes in Ready state. Load balancer starts responding after sometime. 

      Expected results:

      Windows workload runs in available Windows nodes without no service disruption 

      Additional info:

      Follow-up to https://bugzilla.redhat.com/show_bug.cgi?id=2103631

        1. 35707_AWS_412.log
          178 kB
          Jose Luis Franco Arza

              rh-ee-ssoto Sebastian Soto
              openshift-crt-jira-prow OpenShift Prow Bot
              Aharon Rasouli Aharon Rasouli
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: