-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.12.z
-
None
Description of problem:
During rollback from OVNKubernetes to OpenShiftSDN, after changing network type in Network.config.openshift.io causing cidr conflict errors.
Logs:
misalunk@misalunk-mac ansible-sdn-to-ovn-migration % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.12.71 False False True 39m OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.misalunk-migration37.devcluster.openshift.com/healthz": EOF baremetal 4.12.71 True False False 72m cloud-controller-manager 4.12.71 True False False 75m cloud-credential 4.12.71 True False False 76m cluster-autoscaler 4.12.71 True False False 72m config-operator 4.12.71 True False False 73m console 4.12.71 False False False 39m RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.misalunk-migration37.devcluster.openshift.com): Get "https://console-openshift-console.apps.misalunk-migration37.devcluster.openshift.com": EOF control-plane-machine-set 4.12.71 True False False 71m csi-snapshot-controller 4.12.71 True False False 72m dns 4.12.71 True False False 72m etcd 4.12.71 True False False 71m image-registry 4.12.71 True False False 65m ingress 4.12.71 True False True 64m The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing) insights 4.12.71 True False False 66m kube-apiserver 4.12.71 True False False 60m kube-controller-manager 4.12.71 True False False 69m kube-scheduler 4.12.71 True False False 69m kube-storage-version-migrator 4.12.71 True False False 73m machine-api 4.12.71 True False False 66m machine-approver 4.12.71 True False False 72m machine-config 4.12.71 True False True 64m Failed to resync 4.12.71 because: Required MachineConfigPool 'master' is paused and can not sync until it is unpaused marketplace 4.12.71 True False False 72m monitoring 4.12.71 True False False 64m network 4.12.71 True True True 75m DaemonSet "/openshift-sdn/sdn" rollout is not making progress - last change 2025-02-05T00:42:50Z node-tuning 4.12.71 True False False 72m openshift-apiserver 4.12.71 True False False 60m openshift-controller-manager 4.12.71 True False False 68m openshift-samples 4.12.71 True False False 65m operator-lifecycle-manager 4.12.71 True False False 72m operator-lifecycle-manager-catalog 4.12.71 True False False 72m operator-lifecycle-manager-packageserver 4.12.71 True False False 66m service-ca 4.12.71 True False False 73m storage 4.12.71 True False False 72m
misalunk@misalunk-mac ansible-sdn-to-ovn-migration % oc get pods -n openshift-sdn NAME READY STATUS RESTARTS AGE sdn-controller-gtc5q 1/2 CrashLoopBackOff 12 (59s ago) 39m sdn-controller-qcsmq 2/2 Running 8 (17m ago) 39m sdn-controller-wrkmn 2/2 Running 12 (3m39s ago) 39m sdn-hsck5 1/2 Running 9 (6m37s ago) 39m sdn-l9pwp 1/2 Error 9 (7m6s ago) 39m sdn-lflwp 1/2 Running 9 (6m50s ago) 39m sdn-qz2fg 1/2 Running 9 (6m54s ago) 39m sdn-s76c6 1/2 Running 9 (6m52s ago) 39m sdn-xxjz4 1/2 Running 9 (6m53s ago) 39m
misalunk@misalunk-mac ansible-sdn-to-ovn-migration % oc logs sdn-controller-gtc5q -n openshift-sdn Defaulted container "sdn-controller" out of: sdn-controller, kube-rbac-proxy I0205 01:14:15.286796 1 server.go:27] Starting HTTP metrics server I0205 01:14:15.286891 1 leaderelection.go:248] attempting to acquire leader lease openshift-sdn/openshift-network-controller... I0205 01:21:45.043815 1 leaderelection.go:258] successfully acquired lease openshift-sdn/openshift-network-controller I0205 01:21:45.043914 1 event.go:285] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"openshift-sdn", Name:"openshift-network-controller", UID:"8f066780-17f1-41c1-9cf0-f902f68e3f9c", APIVersion:"v1", ResourceVersion:"49665", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' ip-10-0-130-36 became leader I0205 01:21:45.043935 1 event.go:285] Event(v1.ObjectReference{Kind:"Lease", Namespace:"openshift-sdn", Name:"openshift-network-controller", UID:"c35cdcff-6ad5-454b-b97f-a5b765813da5", APIVersion:"coordination.k8s.io/v1", ResourceVersion:"49666", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' ip-10-0-130-36 became leader I0205 01:21:45.044229 1 master.go:56] Initializing SDN master F0205 01:21:45.049989 1 network_controller.go:54] Error starting OpenShift Network Controller: cluster IP: 10.128.0.0 conflicts with host network: 10.129.0.0/23 misalunk@misalunk-mac ansible-sdn-to-ovn-migration % misalunk@misalunk-mac ansible-sdn-to-ovn-migration % misalunk@misalunk-mac ansible-sdn-to-ovn-migration % oc logs sdn-l9pwp -n openshift-sdn Defaulted container "sdn" out of: sdn, kube-rbac-proxy I0205 01:20:47.315954 79409 cmd.go:128] Reading proxy configuration from /config/kube-proxy-config.yaml I0205 01:20:47.316570 79409 feature_gate.go:245] feature gates: &{map[]} I0205 01:20:47.316608 79409 cmd.go:232] Watching config file /config/kube-proxy-config.yaml for changes I0205 01:20:47.316635 79409 cmd.go:232] Watching config file /config/..2025_02_05_00_42_50.793092302/kube-proxy-config.yaml for changes E0205 01:20:47.340084 79409 node.go:220] Local networks conflict with SDN; this will eventually cause problems: cluster IP: 10.128.0.0 conflicts with host network: 10.130.0.0/23 I0205 01:20:47.340146 79409 node.go:153] Initializing SDN node "ip-10-0-161-148.ec2.internal" (10.0.161.148) of type "redhat/openshift-ovs-networkpolicy" I0205 01:20:47.340342 79409 cmd.go:174] Starting node networking (4.12.0-202412170201.p0.g9706f96.assembly.stream.el8-9706f96) I0205 01:20:47.340352 79409 node.go:315] Starting openshift-sdn network plugin W0205 01:20:47.345039 79409 subnets.go:156] Could not find an allocated subnet for node: ip-10-0-161-148.ec2.internal, Waiting... W0205 01:20:48.348359 79409 subnets.go:156] Could not find an allocated subnet for node: ip-10-0-161-148.ec2.internal, Waiting... W0205 01:20:49.851506 79409 subnets.go:156] Could not find an allocated subnet for node: ip-10-0-161-148.ec2.internal, Waiting... W0205 01:20:52.105557 79409 subnets.go:156] Could not find an allocated subnet for node: ip-10-0-161-148.ec2.internal, Waiting... W0205 01:20:55.485830 79409 subnets.go:156] Could not find an allocated subnet for node: ip-10-0-161-148.ec2.internal, Waiting... W0205 01:21:00.556531 79409 subnets.go:156] Could not find an allocated subnet for node: ip-10-0-161-148.ec2.internal, Waiting... W0205 01:21:08.155902 79409 subnets.go:156] Could not find an allocated subnet for node: ip-10-0-161-148.ec2.internal, Waiting... W0205 01:21:19.553740 79409 subnets.go:156] Could not find an allocated subnet for node: ip-10-0-161-148.ec2.internal, Waiting... W0205 01:21:36.653848 79409 subnets.go:156] Could not find an allocated subnet for node: ip-10-0-161-148.ec2.internal, Waiting... W0205 01:22:02.292709 79409 subnets.go:156] Could not find an allocated subnet for node: ip-10-0-161-148.ec2.internal, Waiting... W0205 01:22:40.744522 79409 subnets.go:156] Could not find an allocated subnet for node: ip-10-0-161-148.ec2.internal, Waiting... F0205 01:22:40.744544 79409 cmd.go:118] Failed to start sdn: failed to get subnet for this host: ip-10-0-161-148.ec2.internal, error: timed out waiting for the condition
Version-Release number of selected component (if applicable): 4.12
How reproducible: Always
Steps to Reproduce:
1. run all 6 steps mentioned in document.
2.
3.
Actual results:
Expected results:
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
- internal CI failure
- customer issue / SD
- internal RedHat testing failure
If it is an internal RedHat testing failure:
- Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).
If it is a CI failure:
- Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
- Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
- Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
- When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
- If it's a connectivity issue,
- What is the srcNode, srcIP and srcNamespace and srcPodName?
- What is the dstNode, dstIP and dstNamespace and dstPodName?
- What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
If it is a customer / SD issue:
- Provide enough information in the bug description that Engineering doesn't need to read the entire case history.
- Don't presume that Engineering has access to Salesforce.
- Do presume that Engineering will access attachments through supportshell.
- Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
- Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
- If the issue is in a customer namespace then provide a namespace inspect.
- If it is a connectivity issue:
- What is the srcNode, srcNamespace, srcPodName and srcPodIP?
- What is the dstNode, dstNamespace, dstPodName and dstPodIP?
- What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
- Please provide the UTC timestamp networking outage window from must-gather
- Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
- If it is not a connectivity issue:
- Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.
- When showing the results from commands, include the entire command in the output.
- For OCPBUGS in which the issue has been identified, label with "sbr-triaged"
- For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with "sbr-untriaged"
- Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
- Note: bugs that do not meet these minimum standards will be closed with label "SDN-Jira-template"
- For guidance on using this template please see
OCPBUGS Template Training for Networking components
- account is impacted by
-
CORENET-658 Support for Ansible playbook for offline migration
-
- In Progress
-