Loading...

Type: Bug
Resolution: Done
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.12.z
Component/s: Networking / ovn-kubernetes
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None
Epic Link:
Support for Ansible playbook for offline migration

Target Backport Versions:
None
Target Version:

4.19.0
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

During rollback from OVNKubernetes to OpenShiftSDN, after changing network type in Network.config.openshift.io causing cidr conflict errors.

Logs:

misalunk@misalunk-mac ansible-sdn-to-ovn-migration % oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.12.71   False       False         True       39m     OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.misalunk-migration37.devcluster.openshift.com/healthz": EOF
baremetal                                  4.12.71   True        False         False      72m     
cloud-controller-manager                   4.12.71   True        False         False      75m     
cloud-credential                           4.12.71   True        False         False      76m     
cluster-autoscaler                         4.12.71   True        False         False      72m     
config-operator                            4.12.71   True        False         False      73m     
console                                    4.12.71   False       False         False      39m     RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.misalunk-migration37.devcluster.openshift.com): Get "https://console-openshift-console.apps.misalunk-migration37.devcluster.openshift.com": EOF
control-plane-machine-set                  4.12.71   True        False         False      71m     
csi-snapshot-controller                    4.12.71   True        False         False      72m     
dns                                        4.12.71   True        False         False      72m     
etcd                                       4.12.71   True        False         False      71m     
image-registry                             4.12.71   True        False         False      65m     
ingress                                    4.12.71   True        False         True       64m     The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
insights                                   4.12.71   True        False         False      66m     
kube-apiserver                             4.12.71   True        False         False      60m     
kube-controller-manager                    4.12.71   True        False         False      69m     
kube-scheduler                             4.12.71   True        False         False      69m     
kube-storage-version-migrator              4.12.71   True        False         False      73m     
machine-api                                4.12.71   True        False         False      66m     
machine-approver                           4.12.71   True        False         False      72m     
machine-config                             4.12.71   True        False         True       64m     Failed to resync 4.12.71 because: Required MachineConfigPool 'master' is paused and can not sync until it is unpaused
marketplace                                4.12.71   True        False         False      72m     
monitoring                                 4.12.71   True        False         False      64m     
network                                    4.12.71   True        True          True       75m     DaemonSet "/openshift-sdn/sdn" rollout is not making progress - last change 2025-02-05T00:42:50Z
node-tuning                                4.12.71   True        False         False      72m     
openshift-apiserver                        4.12.71   True        False         False      60m     
openshift-controller-manager               4.12.71   True        False         False      68m     
openshift-samples                          4.12.71   True        False         False      65m     
operator-lifecycle-manager                 4.12.71   True        False         False      72m     
operator-lifecycle-manager-catalog         4.12.71   True        False         False      72m     
operator-lifecycle-manager-packageserver   4.12.71   True        False         False      66m     
service-ca                                 4.12.71   True        False         False      73m     
storage                                    4.12.71   True        False         False      72m

misalunk@misalunk-mac ansible-sdn-to-ovn-migration % oc get pods -n openshift-sdn                     NAME                   READY   STATUS             RESTARTS         AGE sdn-controller-gtc5q   1/2     CrashLoopBackOff   12 (59s ago)     39m sdn-controller-qcsmq   2/2     Running            8 (17m ago)      39m sdn-controller-wrkmn   2/2     Running            12 (3m39s ago)   39m sdn-hsck5              1/2     Running            9 (6m37s ago)    39m sdn-l9pwp              1/2     Error              9 (7m6s ago)     39m sdn-lflwp              1/2     Running            9 (6m50s ago)    39m sdn-qz2fg              1/2     Running            9 (6m54s ago)    39m sdn-s76c6              1/2     Running            9 (6m52s ago)    39m sdn-xxjz4              1/2     Running            9 (6m53s ago)    39m

misalunk@misalunk-mac ansible-sdn-to-ovn-migration % oc logs sdn-controller-gtc5q -n openshift-sdn
Defaulted container "sdn-controller" out of: sdn-controller, kube-rbac-proxy
I0205 01:14:15.286796       1 server.go:27] Starting HTTP metrics server
I0205 01:14:15.286891       1 leaderelection.go:248] attempting to acquire leader lease openshift-sdn/openshift-network-controller...
I0205 01:21:45.043815       1 leaderelection.go:258] successfully acquired lease openshift-sdn/openshift-network-controller
I0205 01:21:45.043914       1 event.go:285] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"openshift-sdn", Name:"openshift-network-controller", UID:"8f066780-17f1-41c1-9cf0-f902f68e3f9c", APIVersion:"v1", ResourceVersion:"49665", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' ip-10-0-130-36 became leader
I0205 01:21:45.043935       1 event.go:285] Event(v1.ObjectReference{Kind:"Lease", Namespace:"openshift-sdn", Name:"openshift-network-controller", UID:"c35cdcff-6ad5-454b-b97f-a5b765813da5", APIVersion:"coordination.k8s.io/v1", ResourceVersion:"49666", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' ip-10-0-130-36 became leader
I0205 01:21:45.044229       1 master.go:56] Initializing SDN master
F0205 01:21:45.049989       1 network_controller.go:54] Error starting OpenShift Network Controller: cluster IP: 10.128.0.0 conflicts with host network: 10.129.0.0/23
misalunk@misalunk-mac ansible-sdn-to-ovn-migration % 
misalunk@misalunk-mac ansible-sdn-to-ovn-migration % 


misalunk@misalunk-mac ansible-sdn-to-ovn-migration % oc logs sdn-l9pwp  -n openshift-sdn
Defaulted container "sdn" out of: sdn, kube-rbac-proxy
I0205 01:20:47.315954   79409 cmd.go:128] Reading proxy configuration from /config/kube-proxy-config.yaml
I0205 01:20:47.316570   79409 feature_gate.go:245] feature gates: &{map[]}
I0205 01:20:47.316608   79409 cmd.go:232] Watching config file /config/kube-proxy-config.yaml for changes
I0205 01:20:47.316635   79409 cmd.go:232] Watching config file /config/..2025_02_05_00_42_50.793092302/kube-proxy-config.yaml for changes
E0205 01:20:47.340084   79409 node.go:220] Local networks conflict with SDN; this will eventually cause problems: cluster IP: 10.128.0.0 conflicts with host network: 10.130.0.0/23
I0205 01:20:47.340146   79409 node.go:153] Initializing SDN node "ip-10-0-161-148.ec2.internal" (10.0.161.148) of type "redhat/openshift-ovs-networkpolicy"
I0205 01:20:47.340342   79409 cmd.go:174] Starting node networking (4.12.0-202412170201.p0.g9706f96.assembly.stream.el8-9706f96)
I0205 01:20:47.340352   79409 node.go:315] Starting openshift-sdn network plugin
W0205 01:20:47.345039   79409 subnets.go:156] Could not find an allocated subnet for node: ip-10-0-161-148.ec2.internal, Waiting...
W0205 01:20:48.348359   79409 subnets.go:156] Could not find an allocated subnet for node: ip-10-0-161-148.ec2.internal, Waiting...
W0205 01:20:49.851506   79409 subnets.go:156] Could not find an allocated subnet for node: ip-10-0-161-148.ec2.internal, Waiting...
W0205 01:20:52.105557   79409 subnets.go:156] Could not find an allocated subnet for node: ip-10-0-161-148.ec2.internal, Waiting...
W0205 01:20:55.485830   79409 subnets.go:156] Could not find an allocated subnet for node: ip-10-0-161-148.ec2.internal, Waiting...
W0205 01:21:00.556531   79409 subnets.go:156] Could not find an allocated subnet for node: ip-10-0-161-148.ec2.internal, Waiting...
W0205 01:21:08.155902   79409 subnets.go:156] Could not find an allocated subnet for node: ip-10-0-161-148.ec2.internal, Waiting...
W0205 01:21:19.553740   79409 subnets.go:156] Could not find an allocated subnet for node: ip-10-0-161-148.ec2.internal, Waiting...
W0205 01:21:36.653848   79409 subnets.go:156] Could not find an allocated subnet for node: ip-10-0-161-148.ec2.internal, Waiting...
W0205 01:22:02.292709   79409 subnets.go:156] Could not find an allocated subnet for node: ip-10-0-161-148.ec2.internal, Waiting...
W0205 01:22:40.744522   79409 subnets.go:156] Could not find an allocated subnet for node: ip-10-0-161-148.ec2.internal, Waiting...
F0205 01:22:40.744544   79409 cmd.go:118] Failed to start sdn: failed to get subnet for this host: ip-10-0-161-148.ec2.internal, error: timed out waiting for the condition

Version-Release number of selected component (if applicable): 4.12

How reproducible: Always

Steps to Reproduce:

1. run all 6 steps mentioned in document.

2.

3.

Actual results:

Expected results:

Additional info:

Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

Affected Platforms:

Is it an

internal CI failure
customer issue / SD
internal RedHat testing failure

If it is an internal RedHat testing failure:

Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
If it's a connectivity issue,
What is the srcNode, srcIP and srcNamespace and srcPodName?
What is the dstNode, dstIP and dstNamespace and dstPodName?
What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

Provide enough information in the bug description that Engineering doesn't need to read the entire case history.
Don't presume that Engineering has access to Salesforce.
Do presume that Engineering will access attachments through supportshell.
Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
- If the issue is in a customer namespace then provide a namespace inspect.
- If it is a connectivity issue:
  - What is the srcNode, srcNamespace, srcPodName and srcPodIP?
  - What is the dstNode, dstNamespace, dstPodName and dstPodIP?
  - What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
  - Please provide the UTC timestamp networking outage window from must-gather
  - Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
- If it is not a connectivity issue:
  - Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.

When showing the results from commands, include the entire command in the output.
For OCPBUGS in which the issue has been identified, label with "sbr-triaged"
For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with "sbr-untriaged"
Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
Note: bugs that do not meet these minimum standards will be closed with label "SDN-Jira-template"
For guidance on using this template please see
OCPBUGS Template Training for Networking components

account is impacted by

CORENET-658 Support for Ansible playbook for offline migration

Closed

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide