-
Bug
-
Resolution: Unresolved
-
Normal
-
4.19
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
None
-
None
-
Rejected
-
NI&D Sprint 272, NI&D Sprint 274, NI&D Sprint 278
-
3
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
I have deployed OCP 4.19 cluster on baremetal with 22 worker nodes and 2 infra nodes using 4.19.0-ec.3. Then I have applied the OVNK BGP image which is built using the PR build 4.19,openshift/ovn-kubernetes#2239
After 4 to 5 hours, I see some cluster operators getting degraded
[root@e33-h03-000-r650 debug_oc]# oc get co | grep "False True"
authentication 4.19.0-ec.3 False False True 37h OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.vkommadieip29.rdu2.scalelab.redhat.com/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
console 4.19.0-ec.3 False False True 37h RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.vkommadieip29.rdu2.scalelab.redhat.com): Get "https://console-openshift-console.apps.vkommadieip29.rdu2.scalelab.redhat.com": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
ingress 4.19.0-ec.3 True False True 3d13h The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing. Last 1 error messages:...
insights 4.19.0-ec.3 False False True 33h Failed to upload data: unable to build request to connect to Insights server: Post "https://console.redhat.com/api/ingress/v1/upload": dial tcp 23.40.100.203:443: i/o timeout
kube-controller-manager 4.19.0-ec.3 True False True 3d13h GarbageCollectorDegraded: error fetching rules: client_error: client error: 401
ingress-operator is showing below error in its logs
2025-03-24T12:12:20.904051049Z 2025-03-24T12:12:20.903Z ERROR operator.canary_controller wait/backoff.go:226 error performing canary route check {"error": "error sending canary HTTP Request: Timeout: Get \"https://canary-openshift-ingress-canary.apps.vkommadieip29.rdu2.scalelab.redhat.com\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
This is not happening when BGP patch is not applied. This is blocking BGP testing on baremetal deployment as prometheus is down.
Version-Release number of selected component (if applicable):
4.19.0-ec.3
OVNK image from build 4.19,openshift/ovn-kubernetes#2239
How reproducible:
Always
Steps to Reproduce:
1. Deploy 4.19.0-ec.3 on baremetal with 24 workers.
2. oc patch featuregate cluster --type=merge -p='{"spec":{"featureSet":"TechPreviewNoUpgrade"}}'
3. oc patch Network.operator.openshift.io cluster --type=merge -p='{"spec":{"additionalRoutingCapabilities":
{"providers": ["FRR"]}, "defaultNetwork":{"ovnKubernetesConfig":{"routeAdvertisements":"Enabled"}}}}'
4. Lable 2 nodes as infra and move ingress, registry and proemthus to infra nodes
5. oc scale --replicas=0 deploy/cluster-version-operator -n openshift-cluster-version
oc -n openshift-network-operator set env deployment.apps/network-operator OVN_IMAGE=quay.io/vkommadi/bgppr2239ovnk:latest
6. git clone -b ovnk-bgp https://github.com/jcaamano/frr-k8s
cd frr-k8s/hack/demo/
./demo.sh
7. oc apply -f ~/frr-k8s/hack/demo/configs/receive_all.yaml
8. cat ~/ra.yaml
apiVersion: k8s.ovn.org/v1
kind: RouteAdvertisements
metadata:
name: default
spec:
networkSelector:
matchLabels:
k8s.ovn.org/default-network: ""
advertisements:
- "PodNetwork"
- "EgressIP"
oc apply -f ~/ra.yaml
9. Wait for 5 to 6 hours, we can see some operators degraded because of health check failures (mainly ingress, prometheus, authentication, console)
Actual results:
health checks for operators are failing as route access is failing when BGP is enabled. We can't conduct scale tests as prometheus is down.
Expected results:
health checks for operators shouldn't fail.
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
- internal RedHat testing failure
If it is an internal RedHat testing failure:
- must-gather is at https://storage.scalelab.redhat.com/anilvenkata/bgp/must-gather.local.5453415228024572335.tar.gz , I can provide live env if the engineer wants.