-
Bug
-
Resolution: Done
-
Major
-
None
-
4.12
-
Moderate
-
None
-
Rejected
-
False
-
-
-
Description of problem:
After node reboot some pods on the rebooted node fail to start: oc describe po -n openshift-kube-controller-manager kube-controller-manager-guard-master-1.kni-qe-31.lab.eng.rdu2.redhat.com ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ErrorAddingLogicalPort 41m controlplane deleteLogicalPort failed for pod openshift-kube-controller-manager_kube-controller-manager-guard-master-1.kni-qe-31.lab.eng.rdu2.redhat.com: cannot delete GR SNAT for pod openshift-kube-controller-manager/kube-controller-manager-guard-master-1.kni-qe-31.lab.eng.rdu2.redhat.com: failed to delete SNAT rule for pod on gateway router GR_master-1.kni-qe-31.lab.eng.rdu2.redhat.com: error in t ransact with ops [{Op:delete Table:NAT Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {49c251e2-d559-49b1-ad66-56e2f95f3c4e}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:NAT Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {ad95fc45-9360-4203-8ec5-95d79367dca1}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:NAT Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {61cd5b51-4b35-4808-bb8b-fda76212340b}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:NAT Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {7053c099-d7e1-4f55-955d-6cb36f82091e}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Router Row:map[] Rows:[] Columns:[] Mutations:[{Column:nat Mutator:delete Value:{GoSet:[{GoUUID:49c251e2-d559-49b1-ad66-56e2f95f3c4e} {GoUUID:ad95fc45-9360-4203-8ec5-95d79367dca1} {GoUUID:61cd5b51-4b35-4808-bb8b-fda76212340b} {GoUUID:7053c099-d7e1-4f55-955d-6cb36f82091e}]}}] Timeout:<nil> Where:[where column _uuid == {bb579280-ea53-4d11-8c4f-d9fc6702314b}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] results [{Count:1 Error: Details: UUID:{GoUUID:} Rows:[]} {C ount:1 Error: Details: UUID:{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID:} Rows:[]} {Count:0 Error:referential i ntegrity violation Details:cannot delete NAT row ad95fc45-9360-4203-8ec5-95d79367dca1 because of 1 remaining reference(s) UUID:{GoUUID:} Rows:[]}] and errors []: referential integrity violation: cannot delete NAT row ad95fc45-9360-4203-8ec5-95d79367dca1 because of 1 remaining reference(s)
and after some time new error even appears
Warning FailedCreatePodSandBox 38m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_kube-controller-manager-guard-master-1.kni-qe-31.lab.eng.rdu2.redhat.com_openshift-kube-controller-manager_cbc6c67a-3b3a-4441-ae7c-f433a9b56895_0(f580c2844bd2bc42ce314871f38ca69bc642b80714b2c10bc0d212cfed75bf02): error adding pod openshift-kube-controller-manager_kube-controller-manager-guard-master-1.kni-qe-31.lab.eng.rdu2.redhat.com to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [openshift-kube-controller-manager/kube-controller-manager-guard-master-1.kni-qe-31.lab.eng.rdu2.redhat.com/cbc6c67a-3b3a-4441-ae7c-f433a9b56895:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-kube-controller-manager/kube-con troller-manager-guard-master-1.kni-qe-31.lab.eng.rdu2.redhat.com f580c2844bd2bc42ce314871f38ca69bc642b80714b2c10bc0d212cfed75bf02] [openshift-kube-controller-manager/kube-controller-manager-guard-master-1.kni-qe-31.lab.eng .rdu2.redhat.com f580c2844bd2bc42ce314871f38ca69bc642b80714b2c10bc0d212cfed75bf02] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded '
Version-Release number of selected component (if applicable):
4.12.0-rc.0 (updated from 4.12.0-ec.5)
How reproducible:
so far just 1st attempt to perform update
Steps to Reproduce:
1. Cordon and drain the node 2. Reboot the node 3. Check pods scheduled on the rebooted nodes after nodes is back online
Actual results:
Some pods fail to start on rebooted node
Expected results:
All pods started on rebooted node
Additional info:
Baremetal dualstack cluster with schedulable masters and 2 workers(in another network) deployed with on premise Assisted Installer