Loading...

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.13
Component/s: Cloud Compute / Unknown
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

This is a regression in behaviour from 4.12

Show
This is a regression in behaviour from 4.12
Story Points:
None
Severity:
Critical
Regression:
None

Target Backport Versions:
None
Target Version:

4.13.0
Release Blocker:
Approved
Sprint:
CLOUD Sprint 232, CLOUD Sprint 233, CLOUD Sprint 234
sprint_count:
3

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
Done
Release Note Type:
Bug Fix
Release Note Text:

Hide
The Kubernetes 1.26 release introduced changes to the node infrastructure, such as removing an unhealthy node with an `NotReady` status from the public load balancer to prevent the node from receiving routing traffic. These changes can impact a {product-title} node that runs inside a cluster on Microsoft Azure, because removing a node from a public load balancer can result in the node losing internet connectivity. This issue might render the node unable to regain a ready status and establish an outbound connection. The {product-title} {product-version} release fixes this issue so that a node marked with `NotReady` status is detected by the `kube-proxy` based health probes, now the default for services of `type=LoadBalancer` with `externalTrafficPolicy=Cluster`, without the need of node detachment from the public load balancer. This means that a node can retain an outbound internet connection throughout these phases. (link:https://issues.redhat.com/browse/OCPBUGS-7359[*~~OCPBUGS-7359~~*])

Show
The Kubernetes 1.26 release introduced changes to the node infrastructure, such as removing an unhealthy node with an `NotReady` status from the public load balancer to prevent the node from receiving routing traffic. These changes can impact a {product-title} node that runs inside a cluster on Microsoft Azure, because removing a node from a public load balancer can result in the node losing internet connectivity. This issue might render the node unable to regain a ready status and establish an outbound connection. The {product-title} {product-version} release fixes this issue so that a node marked with `NotReady` status is detected by the `kube-proxy` based health probes, now the default for services of `type=LoadBalancer` with `externalTrafficPolicy=Cluster`, without the need of node detachment from the public load balancer. This means that a node can retain an outbound internet connection throughout these phases. (link: https://issues.redhat.com/browse/OCPBUGS-7359 [* OCPBUGS-7359 *])

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

On azure, delete a master, old machine stuck in Deleting, some pods in cluster are in ImagePullBackOff, check from azure console, new master did not add into lb backend, seems this lead the machine has no internet connection.

Version-Release number of selected component (if applicable):

4.13.0-0.nightly-2023-02-12-024338

How reproducible:

Always

Steps to Reproduce:

1. Set up a cluster on Azure, networkType ovn
2. Delete a master
3. Check master and pod

Actual results:

Old machine stuck in Deleting,  some pods are in ImagePullBackOff.
 $ oc get machine    
NAME                                    PHASE      TYPE              REGION   ZONE   AGE
zhsunaz2132-5ctmh-master-0              Deleting   Standard_D8s_v3   westus          160m
zhsunaz2132-5ctmh-master-1              Running    Standard_D8s_v3   westus          160m
zhsunaz2132-5ctmh-master-2              Running    Standard_D8s_v3   westus          160m
zhsunaz2132-5ctmh-master-flqqr-0        Running    Standard_D8s_v3   westus          105m
zhsunaz2132-5ctmh-worker-westus-dhwfz   Running    Standard_D4s_v3   westus          152m
zhsunaz2132-5ctmh-worker-westus-dw895   Running    Standard_D4s_v3   westus          152m
zhsunaz2132-5ctmh-worker-westus-xlsgm   Running    Standard_D4s_v3   westus          152m

$ oc describe machine zhsunaz2132-5ctmh-master-flqqr-0  -n openshift-machine-api |grep -i "Load Balancer"
      Internal Load Balancer:  zhsunaz2132-5ctmh-internal
      Public Load Balancer:      zhsunaz2132-5ctmh

$ oc get node            
NAME                                    STATUS     ROLES                  AGE    VERSION
zhsunaz2132-5ctmh-master-0              Ready      control-plane,master   165m   v1.26.0+149fe52
zhsunaz2132-5ctmh-master-1              Ready      control-plane,master   165m   v1.26.0+149fe52
zhsunaz2132-5ctmh-master-2              Ready      control-plane,master   165m   v1.26.0+149fe52
zhsunaz2132-5ctmh-master-flqqr-0        NotReady   control-plane,master   109m   v1.26.0+149fe52
zhsunaz2132-5ctmh-worker-westus-dhwfz   Ready      worker                 152m   v1.26.0+149fe52
zhsunaz2132-5ctmh-worker-westus-dw895   Ready      worker                 152m   v1.26.0+149fe52
zhsunaz2132-5ctmh-worker-westus-xlsgm   Ready      worker                 152m   v1.26.0+149fe52
$ oc describe node zhsunaz2132-5ctmh-master-flqqr-0
  Warning  ErrorReconcilingNode       3m5s (x181 over 108m)  controlplane         [k8s.ovn.org/node-chassis-id annotation not found for node zhsunaz2132-5ctmh-master-flqqr-0, macAddress annotation not found for node "zhsunaz2132-5ctmh-master-flqqr-0" , k8s.ovn.org/l3-gateway-config annotation not found for node "zhsunaz2132-5ctmh-master-flqqr-0"]

$ oc get po --all-namespaces | grep ImagePullBackOf   
openshift-cluster-csi-drivers                      azure-disk-csi-driver-node-l8ng4                                  0/3     Init:ImagePullBackOff   0              113m
openshift-cluster-csi-drivers                      azure-file-csi-driver-node-99k82                                  0/3     Init:ImagePullBackOff   0              113m
openshift-cluster-node-tuning-operator             tuned-bvvh7                                                       0/1     ImagePullBackOff        0              113m
openshift-dns                                      node-resolver-2p4zq                                               0/1     ImagePullBackOff        0              113m
openshift-image-registry                           node-ca-vxv87                                                     0/1     ImagePullBackOff        0              113m
openshift-machine-config-operator                  machine-config-daemon-crt5w                                       1/2     ImagePullBackOff        0              113m
openshift-monitoring                               node-exporter-mmjsm                                               0/2     Init:ImagePullBackOff   0              113m
openshift-multus                                   multus-4cg87                                                      0/1     ImagePullBackOff        0              113m
openshift-multus                                   multus-additional-cni-plugins-mc6vx                               0/1     Init:ImagePullBackOff   0              113m
openshift-ovn-kubernetes                           ovnkube-master-qjjsv                                              0/6     ImagePullBackOff        0              113m
openshift-ovn-kubernetes                           ovnkube-node-k8w6j                                                0/6     ImagePullBackOff        0              113m

Expected results:

Replace master successful

Additional info:

Tested payload 4.13.0-0.nightly-2023-02-03-145213, same result.
Before we have tested in 4.13.0-0.nightly-2023-01-27-165107, all works well.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

backend pools.png
249 kB
2023/02/13 6:33 AM

depends on

OCPBUGS-11143 [Azure] Replace master failed as new master did not add into lb backend

Closed

is cloned by

OCPBUGS-11143 [Azure] Replace master failed as new master did not add into lb backend

Closed

is related to

OCPBUGS-10317 node healthz server is missing in ovnk

Closed

OCPBUGS-10318 node healthz server is missing in ovnk

Closed

links to

openshift/kubernetes#1506: OCPBUGS-7359: Azure: move to kube-proxy LB probes, don't detach masters when unready

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates