Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: 4.19.0
Affects Version/s: 4.19
Component/s: Networking / router
Labels:
- ne-triaged

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:

4.19.z
Release Blocker:
Rejected
Sprint:
NI&D Sprint 272, NI&D Sprint 274, NI&D Sprint 280
sprint_count:
3

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

I have deployed OCP 4.19 cluster on baremetal with 22 worker nodes and 2 infra nodes using 4.19.0-ec.3. Then I have applied the OVNK BGP image which is built using the PR build 4.19,openshift/ovn-kubernetes#2239

After 4 to 5 hours, I see some cluster operators getting degraded
[root@e33-h03-000-r650 debug_oc]# oc get co | grep "False True"
authentication 4.19.0-ec.3 False False True 37h OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.vkommadieip29.rdu2.scalelab.redhat.com/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
console 4.19.0-ec.3 False False True 37h RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.vkommadieip29.rdu2.scalelab.redhat.com): Get "https://console-openshift-console.apps.vkommadieip29.rdu2.scalelab.redhat.com": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
ingress 4.19.0-ec.3 True False True 3d13h The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing. Last 1 error messages:...
insights 4.19.0-ec.3 False False True 33h Failed to upload data: unable to build request to connect to Insights server: Post "https://console.redhat.com/api/ingress/v1/upload": dial tcp 23.40.100.203:443: i/o timeout
kube-controller-manager 4.19.0-ec.3 True False True 3d13h GarbageCollectorDegraded: error fetching rules: client_error: client error: 401

ingress-operator is showing below error in its logs
2025-03-24T12:12:20.904051049Z 2025-03-24T12:12:20.903Z ERROR operator.canary_controller wait/backoff.go:226 error performing canary route check {"error": "error sending canary HTTP Request: Timeout: Get \"https://canary-openshift-ingress-canary.apps.vkommadieip29.rdu2.scalelab.redhat.com\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}

This is not happening when BGP patch is not applied. This is blocking BGP testing on baremetal deployment as prometheus is down.

Version-Release number of selected component (if applicable):

4.19.0-ec.3

OVNK image from build 4.19,openshift/ovn-kubernetes#2239

How reproducible:

Always

Steps to Reproduce:

1. Deploy 4.19.0-ec.3 on baremetal with 24 workers.

2. oc patch featuregate cluster --type=merge -p='{"spec":{"featureSet":"TechPreviewNoUpgrade"}}'

3. oc patch Network.operator.openshift.io cluster --type=merge -p='{"spec":{"additionalRoutingCapabilities":

{"providers": ["FRR"]}

, "defaultNetwork":{"ovnKubernetesConfig":{"routeAdvertisements":"Enabled"}}}}'

4. Lable 2 nodes as infra and move ingress, registry and proemthus to infra nodes

5. oc scale --replicas=0 deploy/cluster-version-operator -n openshift-cluster-version

oc -n openshift-network-operator set env deployment.apps/network-operator OVN_IMAGE=quay.io/vkommadi/bgppr2239ovnk:latest

6. git clone -b ovnk-bgp https://github.com/jcaamano/frr-k8s
cd frr-k8s/hack/demo/

./demo.sh

7. oc apply -f ~/frr-k8s/hack/demo/configs/receive_all.yaml

8. cat ~/ra.yaml
apiVersion: k8s.ovn.org/v1
kind: RouteAdvertisements
metadata:
name: default
spec:
networkSelector:
matchLabels:
k8s.ovn.org/default-network: ""
advertisements:
- "PodNetwork"
- "EgressIP"

oc apply -f ~/ra.yaml

9. Wait for 5 to 6 hours, we can see some operators degraded because of health check failures (mainly ingress, prometheus, authentication, console)

Actual results:

health checks for operators are failing as route access is failing when BGP is enabled. We can't conduct scale tests as prometheus is down.

Expected results:

health checks for operators shouldn't fail.

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

internal RedHat testing failure

If it is an internal RedHat testing failure:

must-gather is at https://storage.scalelab.redhat.com/anilvenkata/bgp/must-gather.local.5453415228024572335.tar.gz , I can provide live env if the engineer wants.

links to

openshift/cluster-network-operator#2677: OCPBUGS-54159: frr-k8s: tolerate all taints

Assignee:: Miciah Masters

Reporter:: VENKATA ANIL kumar KOMMADDI

Need Info From:: None

Contributors:: None

QA Contact:: None

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Created:: 2025/03/24 2:15 PM

Updated:: 2025/11/17 2:51 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates