-
Bug
-
Resolution: Done
-
Major
-
4.11.z
-
None
-
Quality / Stability / Reliability
-
False
-
-
3
-
None
-
None
-
None
-
Rejected
-
WINC - Sprint 228
-
1
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Workload's load balancer with external IP shows connectivity outage during Windows node upgrade when using windows/servercore image. During the reconciliation, a new node is being created therefore when the draining before the reconciliation of another node happens, the node does not contain the containers image anymore. If the time required to download the image is longer than the time it takes to reconcile the node we will end up in a situation in which no workload is available to handle the Load Balancer's requests, ending up in a service disruption.
Version-Release number of selected component (if applicable):
4.11
How reproducible:
Sometimes
Steps to Reproduce:
Create a script to continuously query a load balancer endpoint External IP or DNS name:
```
cat probeLB.sh
#!/bin/bash
set -e
while true
do
date
echo "curl 52.189.34.88"
curl 52.189.34.88
echo ""
sleep 2
done
```
1. In a OCP cluster deploy WMCO 6.0
2. Create a Windows machineSet with 3 replicas
3. Wait for WMCO to configure the Windows nodes
4. Deploy win-server workloads with at least 3 replicas
5. Deploy load balancer
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/win-webserver LoadBalancer 172.30.105.53 52.189.34.88 80:30648/TCP 115m
6. Scale down WMCO deployment to 0
oc scale deployment.apps/windows-machine-config-operator --replicas=0 -n openshift-windows-machine-config-operator
7. Trigger Windows node upgrade by changing the version annotation in all Windows nodes.
oc annotate node <windows-node-1> --overwrite windowsmachineconfig.openshift.io/version=invalidVersion
oc annotate node <windows-node-2> --overwrite windowsmachineconfig.openshift.io/version=invalidVersion
oc annotate node <windows-node-3> --overwrite windowsmachineconfig.openshift.io/version=invalidVersion
8. In a separate terminal, trigger the script to query a load balancer endpoint (probeLB.sh)
100 63 100 63 0 0 777 0 --:--:-- --:--:-- --:--:-- 777
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
9. Scale up WMCO deployment to 1
oc scale deployment.apps/windows-machine-config-operator --replicas=1 -n openshift-windows-machine-config-operator
10. Watch the script for the load balancer endpoint
100 63 100 63 0 0 741 0 --:--:-- --:--:-- --:--:-- 741
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:00:12 --:--:-- 0curl: (7) Failed to connect to 52.189.34.88 port 80: Connection refused
Actual results:
Load balancer connectivity lost with Windows nodes in Ready state. Load balancer starts responding after sometime.
Expected results:
Windows workload runs in available Windows nodes without no service disruption
Additional info:
Follow-up to https://bugzilla.redhat.com/show_bug.cgi?id=2103631
- blocks
-
OCPBUGS-4092 Load balancer shows connectivity outage during Windows nodes upgrade
-
- Closed
-
- is cloned by
-
OCPBUGS-4092 Load balancer shows connectivity outage during Windows nodes upgrade
-
- Closed
-
- is duplicated by
-
WINC-929 Machine Nodes follow same upgrade path as BYOH nodes
-
- Closed
-
- links to