Loading...

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: 4.16.0
Component/s: Networking / ovn-kubernetes
Labels:
- SDN:SCALE

Severity:
Important
Regression:
No
Sprint:
SDN Sprint 250
sprint_count:
1
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:
PX Technical Impact Notes:
05/22 Resolved with OCPBUG-28745 according to the TAM. Can close
PX Impact Range:
PX Review Complete:
PX Technical Impact:

Original: Description of problem:

When scale up cluster worker to 400 by

oc scale machineset zhsunaz2-42pvb-worker-eastus1 --replicas 200
oc scale machineset zhsunaz2-42pvb-worker-eastus2 --replicas 100
oc scale machineset zhsunaz2-42pvb-worker-eastus3 --replicas 100

found ovnkube-node pod cannot be ready with logs:

I0123 09:47:19.213970 134093 obj_retry.go:607] Update event received for *factory.egressNode zhsunaz2-42pvb-worker-eastus2-fg8j4
I0123 09:47:19.213998 134093 obj_retry.go:555] Update event received for resource *factory.egressFwNode, old object is equal to new: true
I0123 09:47:19.221642 134093 ovs.go:167] Exec(7410): stdout: ""
I0123 09:47:19.221731 134093 ovs.go:168] Exec(7410): stderr: ""
I0123 09:47:19.221761 134093 default_node_network_controller.go:645] Upgrade Hack: checkOVNSBNodeLRSR for node - 10.130.136.0/23 : match match="reg7 == 0 && ip4.dst == 10.130.136.0/23" : stdout - : stderr - : err <nil>
F0123 09:47:19.221788 134093 default_node_network_controller.go:955] Upgrade hack: Timed out waiting for the remote ovnkube-controller to be ready even after 5 minutes, err : context deadline exceeded, upgrade hack: unable to find LRSR for node zhsunaz2-42pvb-worker-eastus2-qmxgw

ovnkube-node-xmgd2 8/9 Running 23 (4m25s ago) 158m 10.0.128.56 zhsunaz2-42pvb-worker-eastus1-gpzs6 <none> <none>
ovnkube-node-xmvgj 8/9 Running 23 (5m45s ago) 154m 10.0.129.74 zhsunaz2-42pvb-worker-eastus3-8dxzl <none> <none>
ovnkube-node-z4hl8 8/9 Running 23 (2m15s ago) 155m 10.0.129.87 zhsunaz2-42pvb-worker-eastus3-4rc8k <none> <none>
ovnkube-node-z4w9t 8/9 Running 23 (3m7s ago) 157m 10.0.128.107 zhsunaz2-42pvb-worker-eastus1-qfx4x <none> <none>
ovnkube-node-z8gl6 8/9 Running 23 (5m51s ago) 158m 10.0.128.58 zhsunaz2-42pvb-worker-eastus1-zgdrd <none> <none>
ovnkube-node-z9xxv 8/9 Running 23 (3m12s ago) 156m 10.0.128.115 zhsunaz2-42pvb-worker-eastus1-j7c9f <none> <none>
ovnkube-node-zbr5q 8/9 CrashLoopBackOff 23 (2m15s ago) 156m 10.0.128.224 zhsunaz2-42pvb-worker-eastus2-chfl4 <none> <none>
ovnkube-node-zcpmd 8/9 CrashLoopBackOff 23 (73s ago) 161m 10.0.128.26 zhsunaz2-42pvb-worker-eastus1-8d247 <none> <none>
ovnkube-node-zg2xn 8/9 Running 23 (4m14s ago) 156m 10.0.128.232 zhsunaz2-42pvb-worker-eastus2-xsrnt <none> <none>
ovnkube-node-zjkpj 8/9 Running 23 (2m28s ago) 155m 10.0.129.40 zhsunaz2-42pvb-worker-eastus2-mgxl7 <none> <none>
ovnkube-node-zmlqb 8/9 Error 23 (6m40s ago) 160m 10.0.128.65 zhsunaz2-42pvb-worker-eastus1-vx5t7 <none> <none>
ovnkube-node-zrcmw 8/9 CrashLoopBackOff 22 (49s ago) 150m 10.0.129.115 zhsunaz2-42pvb-worker-eastus3-zvzzx <none> <none>
ovnkube-node-zsj67 8/9 Running 23 (92s ago) 154m 10.0.129.77 zhsunaz2-42pvb-worker-eastus3-xznxk <none> <none>
ovnkube-node-zskbl 8/9 Running 24 (3m19s ago) 152m 10.0.129.124 zhsunaz2-42pvb-worker-eastus3-29rdd <none> <none>
ovnkube-node-zwt2z 8/9 Running 23 (3m15s ago) 157m 10.0.128.114 zhsunaz2-42pvb-worker-eastus1-ldvjb <none> <none>{code}
Version-Release number of selected component (if applicable):

4.16.0-0.nightly-2024-01-21-154905{code}
How reproducible:

100%{code}
Steps to Reproduce:

    1. setup cluster with ovn on azure
    2. scale up to 400 by machineset
    3.
    {code}
Actual results:

{code}
Expected results:

{code}
Additional info:

{code}

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

internal CI failure
customer issue / SD
internal RedHat testing failure

If it is an internal RedHat testing failure:

Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
If it's a connectivity issue,
What is the srcNode, srcIP and srcNamespace and srcPodName?
What is the dstNode, dstIP and dstNamespace and dstPodName?
What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
Don’t presume that Engineering has access to Salesforce.
Please provide must-gather and sos-report with an exact link to the comment in the support case with the attachment. The format should be: https://access.redhat.com/support/cases/#/case/<case number>/discussion?attachmentId=<attachment id>
Describe what each attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
- If the issue is in a customer namespace then provide a namespace inspect.
- If it is a connectivity issue:
  - What is the srcNode, srcNamespace, srcPodName and srcPodIP?
  - What is the dstNode, dstNamespace, dstPodName and dstPodIP?
  - What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
  - Please provide the UTC timestamp networking outage window from must-gather
  - Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
- If it is not a connectivity issue:
  - Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.

For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, labels with “sbr-untriaged”
Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates