Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-3676

After node's reboot some pods fail to start - deleteLogicalPort failed for pod cannot delete GR SNAT for pod

XMLWordPrintable

    • Moderate
    • None
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      11/28: Green as there's a fix targeted to 4.12; waiting on CI to pass.
      11/28: added to the 4.12 gating list
      Show
      11/28: Green as there's a fix targeted to 4.12; waiting on CI to pass. 11/28: added to the 4.12 gating list

      Description of problem:

      After node reboot some pods on the rebooted node fail to start:
      oc describe po -n openshift-kube-controller-manager kube-controller-manager-guard-master-1.kni-qe-31.lab.eng.rdu2.redhat.com
      ...
      Events:                                                                                                                                                                                                                         Type     Reason                  Age   From          Message                                                                                                                                                                  ----     ------                  ----  ----          -------                                                                                                                                                                  Warning  ErrorAddingLogicalPort  41m   controlplane  deleteLogicalPort failed for pod openshift-kube-controller-manager_kube-controller-manager-guard-master-1.kni-qe-31.lab.eng.rdu2.redhat.com: cannot delete GR SNAT for
      pod openshift-kube-controller-manager/kube-controller-manager-guard-master-1.kni-qe-31.lab.eng.rdu2.redhat.com: failed to delete SNAT rule for pod on gateway router GR_master-1.kni-qe-31.lab.eng.rdu2.redhat.com: error in t
      ransact with ops [{Op:delete Table:NAT Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {49c251e2-d559-49b1-ad66-56e2f95f3c4e}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:NAT Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {ad95fc45-9360-4203-8ec5-95d79367dca1}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:NAT Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {61cd5b51-4b35-4808-bb8b-fda76212340b}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:NAT Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {7053c099-d7e1-4f55-955d-6cb36f82091e}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Router Row:map[] Rows:[] Columns:[] Mutations:[{Column:nat Mutator:delete Value:{GoSet:[{GoUUID:49c251e2-d559-49b1-ad66-56e2f95f3c4e} {GoUUID:ad95fc45-9360-4203-8ec5-95d79367dca1} {GoUUID:61cd5b51-4b35-4808-bb8b-fda76212340b} {GoUUID:7053c099-d7e1-4f55-955d-6cb36f82091e}]}}] Timeout:<nil> Where:[where column _uuid == {bb579280-ea53-4d11-8c4f-d9fc6702314b}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] results [{Count:1 Error: Details: UUID:{GoUUID:} Rows:[]} {C
      ount:1 Error: Details: UUID:{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID:} Rows:[]} {Count:0 Error:referential i
      ntegrity violation Details:cannot delete NAT row ad95fc45-9360-4203-8ec5-95d79367dca1 because of 1 remaining reference(s) UUID:{GoUUID:} Rows:[]}] and errors []: referential integrity violation: cannot delete NAT row ad95fc45-9360-4203-8ec5-95d79367dca1 because of 1 remaining reference(s)

      and after some time new error even appears

         Warning  FailedCreatePodSandBox  38m   kubelet       Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_kube-controller-manager-guard-master-1.kni-qe-31.lab.eng.rdu2.redhat.com_openshift-kube-controller-manager_cbc6c67a-3b3a-4441-ae7c-f433a9b56895_0(f580c2844bd2bc42ce314871f38ca69bc642b80714b2c10bc0d212cfed75bf02): error adding pod openshift-kube-controller-manager_kube-controller-manager-guard-master-1.kni-qe-31.lab.eng.rdu2.redhat.com to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [openshift-kube-controller-manager/kube-controller-manager-guard-master-1.kni-qe-31.lab.eng.rdu2.redhat.com/cbc6c67a-3b3a-4441-ae7c-f433a9b56895:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-kube-controller-manager/kube-con
      troller-manager-guard-master-1.kni-qe-31.lab.eng.rdu2.redhat.com f580c2844bd2bc42ce314871f38ca69bc642b80714b2c10bc0d212cfed75bf02] [openshift-kube-controller-manager/kube-controller-manager-guard-master-1.kni-qe-31.lab.eng
      .rdu2.redhat.com f580c2844bd2bc42ce314871f38ca69bc642b80714b2c10bc0d212cfed75bf02] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded                                                 '

      Version-Release number of selected component (if applicable):

      4.12.0-rc.0 (updated from 4.12.0-ec.5)

      How reproducible:

      so far just 1st attempt to perform update

      Steps to Reproduce:

      1. Cordon and drain the node
      2. Reboot the node
      3. Check pods scheduled on the rebooted nodes after nodes is back online
      

      Actual results:

      Some pods fail to start on rebooted node

      Expected results:

      All pods started on rebooted node

      Additional info:

      Baremetal dualstack cluster with schedulable masters and 2 workers(in another network) deployed with on premise Assisted Installer

              jcaamano@redhat.com Jaime Caamaño Ruiz
              yprokule@redhat.com Yurii Prokulevych
              Anurag Saxena Anurag Saxena
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: