Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-174

[OVN] New pods unable to establish TCP connections and get constant timeouts causing application downtime

    XMLWordPrintable

Details

    • Critical
    • SDN Sprint 223, SDN Sprint 224
    • Hide

      None

      Show
      None

    Description

      +++ This bug was initially created as a clone of Bug #2118717 +++
      https://bugzilla.redhat.com/show_bug.cgi?id=2118717

      Description of problem:
      This BZ is a spin-off of BZ-2114945 so we can track possible issues with new TCP connections from pods failing to be created on the nodes leading to pods being unable to start or crash.

      Version-Release number of selected component (if applicable):
      OCP 4.10.24 with OVN-Kubernetes

      How reproducible:
      Periodically on the customer only so far.

      — Additional comment from Andre Costa on 2022-08-10 16:30:00 UTC —

      There are 3 must-gathers here that were gathered during the issues and after the restart of OVNk-masters which makes all these issues go away and pods start connections immediately.

      This must-gather was taken at 11 AM today when they received a report from one of the customers:

      https://attachments.access.redhat.com/hydra/rest/cases/03096770/attachments/3f820e22-3d1e-4c01-b56d-e479a6cc5578

      Customer reported the issue again and this time we also got sosreport and inspect from the project.
      In the pod they get errors like this (the same we saw on the call last week with them where it seems no TCP connections entries are created at all. First we though it was DNS but even with IPs directly there were issues like this):
      -----------------
      mx-toni-dev toni-dev-build 0/1 Error 0 18m 10.195.80.253 demchdc5vvx <none> <none>

      [z0003rbj-z07@stuart ~]$ oc logs toni-dev-build
      time="2022-08-10T10:59:02Z" level=info msg="Start building app with registry type openshift"
      time="2022-08-10T10:59:02Z" level=info msg="Adding ssl certificate /etc/ssl/certs/ca-bundle.crt"
      time="2022-08-10T10:59:02Z" level=info msg="Certificate /etc/ssl/certs/ca-bundle.crt has been added successfully"
      time="2022-08-10T10:59:02Z" level=info msg="Updating docker config with registry credentials"
      time="2022-08-10T10:59:02Z" level=info msg="Docker config has been updated with registry credentials"
      time="2022-08-10T10:59:02Z" level=info msg="Downloading MDA from https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/eba21059-8896-4cb7-8971-4d61f6756273/71042"
      time="2022-08-10T10:59:32Z" level=error msg="Failed to build mendix app, failed to create application layer failed to download MDA from https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/eba21059-8896-4cb7-8971-4d61f6756273/71042, Get \"https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/eba21059-8896-4cb7-8971-4d61f6756273/71042\": proxyconnect tcp: dial tcp: i/o timeout: Get \"https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/eba21059-8896-4cb7-8971-4d61f6756273/71042\": proxyconnect tcp: dial tcp: i/o timeout"
      -----------------------

      This keeps happening if they continue to run the builds which they did and created the MG and sosreport:

      https://attachments.access.redhat.com/hydra/rest/cases/03096770/attachments/d10d9646-5437-4806-b34f-3acab4d83266

      https://attachments.access.redhat.com/hydra/rest/cases/03096770/attachments/1e9e0575-66b0-4184-a3d4-e84fa245c9a2

      And like we have seen so far restarting the ovnk-master pods makes these connections work immediately again:

      https://attachments.access.redhat.com/hydra/rest/cases/03096770/attachments/e03ee9c6-2ce4-40d3-a484-84da61607de5

      https://attachments.access.redhat.com/hydra/rest/cases/03096770/attachments/652ebe47-b71f-4d63-8499-cf989b3ebfc7

      — Additional comment from Andre Costa on 2022-08-10 16:30:43 UTC —

      — Additional comment from Tim Rozet on 2022-08-10 22:36:06 UTC —

      Thanks for the must gathers. From Flavio and I examining them, there is definitely a bug here in ovn-kube. The toni-dev-build pod is deleted/recreated multiple times, and during this time it moves to different nodes. However due to a bug in OVNK, this port is updated with the new ip address and information as if it was moving to the new node, but stays on the previous logical switch. So for example, this is what happens:

      1. The pod is originally assigned to node demchdc6zax. This node's cluster subnet is 10.195.79.0/24:

      2022-08-09T09:22:57.078688727+00:00 stderr F I0809 09:22:57.078632 2239319 cni.go:248] [mx-toni-dev/toni-dev-build b1a4fb0be20ff717f85fd0fffab4fb303bbcb0f8b68aced4852fb7a2465d2df1] ADD finished CNI request [mx-toni-dev/toni-dev-build b1a4fb0be20ff717f85fd0fffab4fb303bbcb0f8b68aced4852fb7a2465d2df1], result "{\"interfaces\":[

      {\"name\":\"b1a4fb0be20ff71\",\"mac\":\"a6:68:38:ad:66:c8\"}

      ,

      {\"name\":\"eth0\",\"mac\":\"0a:58:0a:c3:4f:52\",\"sandbox\":\"/var/run/netns/e354d2d5-83cb-406f-a2d9-c5f3e786bae4\"}

      ],\"ips\":[

      {\"version\":\"4\",\"interface\":1,\"address\":\"10.195.79.82/24\",\"gateway\":\"10.195.79.1\"}

      ],\"dns\":{}}", err <nil

      2. Over time this pod is completed, deleted, recreated many times. Until eventually it lands on demchdc5vvx the next day:
      2022-08-10T08:43:58.428759111Z I0810 08:43:58.428719 1837017 cni.go:248] [mx-toni-dev/toni-dev-build 5d4f195cbda5269e5451593987be9d69ea828ee549bc447a5bbe50db847c182a] ADD finished CNI request [mx-toni-dev/toni-dev-build 5d4f195cbda5269e5451593987be9d69ea828ee549bc447a5bbe50db847c182a], result "{\"interfaces\":[

      {\"name\":\"5d4f195cbda5269\",\"mac\":\"a2:22:6c:1d:43:c1\"}

      ,

      {\"name\":\"eth0\",\"mac\":\"0a:58:0a:c3:50:1e\",\"sandbox\":\"/var/run/netns/bae8f77a-b368-4b3a-86dc-df925330fa26\"}

      ],\"ips\":[

      {\"version\":\"4\",\"interface\":1,\"address\":\"10.195.80.30/24\",\"gateway\":\"10.195.80.1\"}

      ],\"dns\":{}}", err <nil>

      3. Although it lands on a new node, in OVNK we update the old port (somehow the old port is not being removed) that is attached to the old switch:
      [root@fedora ~]# ovn-nbctl list logical_switch_port c345cc07-8a89-4e70-beff-d8d9f4dac46a
      _uuid : c345cc07-8a89-4e70-beff-d8d9f4dac46a
      addresses : ["0a:58:0a:c3:50:1e 10.195.80.30"]
      dhcpv4_options : []
      dhcpv6_options : []
      dynamic_addresses : []
      enabled : []
      external_ids :

      {namespace=mx-toni-dev, pod="true"}

      ha_chassis_group : []
      name : mx-toni-dev_toni-dev-build
      options :

      {iface-id-ver="655cc043-53d5-4550-9c0a-74095a0fbde4", requested-chassis=demchdc5vvx}

      parent_name : []
      port_security : ["0a:58:0a:c3:50:1e 10.195.80.30"]
      tag : []
      tag_request : []
      type : ""
      up : false

      [root@fedora ~]# ovn-nbctl lsp-list demchdc6zax | grep c345cc07-8a89-4e70-beff-d8d9f4dac46a
      c345cc07-8a89-4e70-beff-d8d9f4dac46a (mx-toni-dev_toni-dev-build)
      [root@fedora ~]# ovn-nbctl lsp-list demchdc5vvx | grep c345cc07-8a89-4e70-beff-d8d9f4dac46a
      [root@fedora ~]#

      This will cause the pod not to be able to send any traffic as its IP is in the wrong subnet for this switch.

      4. Additionally the default node SNAT for this pod is in the right place:
      [root@fedora ~]# ovn-nbctl lr-nat-list GR_demchdc6zax | grep 10.195.80.30
      [root@fedora ~]# ovn-nbctl lr-nat-list GR_demchdc5vvx | grep 10.195.80.30
      snat 139.25.144.25 10.195.80.30

      5. But there is no egress IP reroute or SNAT entry for this pod:
      Egress IP:
      status:
      items:

      • egressIP: 139.25.144.72
        node: demchdc5z6x
        name: egress-mx-toni-dev

      [root@fedora ~]# ovn-nbctl lr-nat-list GR_demchdc5z6x | grep 10.195.80.30
      [root@fedora ~]# ovn-nbctl lr-nat-list GR_demchdc5z6x | grep 139.25.144.72
      snat 139.25.144.72 10.195.77.156
      snat 139.25.144.72 10.195.80.40
      snat 139.25.144.72 10.195.76.65
      snat 139.25.144.72 10.195.76.184
      snat 139.25.144.72 10.195.80.42
      snat 139.25.144.72 10.195.80.156

      6. We see in the ovnkube-master logs that ovnk attempts to delete this pod, but it fails because we try to delete a logical switch port that is still bound to the wrong logical switch:
      2022-08-10T09:07:32.033303027Z I0810 09:07:32.033270 1 client.go:781] "msg"="transacting operations" "database"="OVN_Northbound" "operations"="[{Op:mutate Table:Address_Set Row:map[] Rows:[] Columns:[] Mutations:[{Column:addresses Mutator:delete Value:{GoSet:[10.195.80.30]}}] Timeout:<nil> Where:[where column _uuid ==

      {4fd75aab-222c-4f17-9407-3764f9b6760b}

      ] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:ports Mutator:delete Value:{GoSet:[

      {GoUUID:c345cc07-8a89-4e70-beff-d8d9f4dac46a}]}}] Timeout:<nil> Where:[where column _uuid == {f5073eaa-3f72-4ec2-94c3-3744d412864a}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:Logical_Switch_Port Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {c345cc07-8a89-4e70-beff-d8d9f4dac46a}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]"

      2022-08-10T09:07:32.037857343Z I0810 09:07:32.037163 1 pods_retry.go:57] [655cc043-53d5-4550-9c0a-74095a0fbde4/mx-toni-dev/toni-dev-build] teardown retry failed; will try again later

      From the engineering side we need to try to reproduce this. Flavio and I think this toni-dev-build must be a stateful set running to completion. We need to investigate further to understand how we ended up trying to add the new pod without first deleting the old. Could be a bug related to completed pods logic or retry logic that was added in 4.10.

      — Additional comment from Anurag saxena on 2022-08-11 04:52:37 UTC —

      Thanks @trozet@redhat.com for investigation. @huirwang@redhat.com @jechen@redhat.com Probably we can follow these steps to internally repro the issue.

      — Additional comment from Andre Costa on 2022-08-11 15:09:25 UTC —

      Hi all,

      Many thanks for the great work here and customer is relieved that we found something

      In the meantime a new example that looks to be another caused by this bug. We have seen this live last week and today the customer saw it again on another project:

      - Similar build pod started by their operator:

      1h21m Warning ErrorAddingLogicalPort pod/ngat-dev-build deleteLogicalPort failed for pod mx-ngat-dev_ngat-dev-build: cannot delete logical switch port mx-ngat-dev_ngat-dev-build, error in transact with ops [{Op:mutate Table:Address_Set Row:map[] Rows:[] Columns:[] Mutations:[{Column:addresses Mutator:delete Value:{GoSet:[10.195.76.185]}}] Timeout:<nil> Where:[where column _uuid == {961a2f8f-14bb-44f9-a945-0dc2211324c7}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:ports Mutator:delete Value:{GoSet:[{GoUUID:d300fdb7-d337-4c64-8e31-7ff02889d9fb}]}}] Timeout:<nil> Where:[where column _uuid == {bc92e86b-a1ae-4771-8e59-0588657bf70e}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:Logical_Switch_Port Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {d300fdb7-d337-4c64-8e31-7ff02889d9fb}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] results [{Count:1 Error: Details: UUID:{GoUUID}

      Rows:[]} {Count:1 Error: Details: UUID:

      {GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID}

      Rows:[]} {Count:0 Error:referential integrity violation Details:cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s) UUID:

      {GoUUID:} Rows:[]}] and errors []: referential integrity violation: cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s)
      1h21m Warning ErrorAddingLogicalPort pod/ngat-dev-build failed to ensurePod mx-ngat-dev/ngat-dev-build since it is not yet scheduled
      Unknown Normal Scheduled pod/ngat-dev-build Successfully assigned mx-ngat-dev/ngat-dev-build to demchdc5vux
      1h21m Normal AddedInterface pod/ngat-dev-build Add eth0 [10.195.81.116/24] from ovn-kubernetes
      1h21m Normal Pulling pod/ngat-dev-build Pulling image "private-cloud.registry.mendix.com/image-builder:2.2.0"
      1h21m Normal Pulled pod/ngat-dev-build Successfully pulled image "private-cloud.registry.mendix.com/image-builder:2.2.0" in 622.636779ms
      1h21m Normal Created pod/ngat-dev-build Created container mendix-build
      1h21m Normal Started pod/ngat-dev-build Started container mendix-build
      1h20m Warning ErrorAddingLogicalPort pod/ngat-dev-build deleteLogicalPort failed for pod mx-ngat-dev_ngat-dev-build: cannot delete logical switch port mx-ngat-dev_ngat-dev-build, error in transact with ops [{Op:mutate Table:Address_Set Row:map[] Rows:[] Columns:[] Mutations:[{Column:addresses Mutator:delete Value:{GoSet:[10.195.81.116]}}] Timeout:<nil> Where:[where column _uuid == {961a2f8f-14bb-44f9-a945-0dc2211324c7}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:ports Mutator:delete Value:{GoSet:[{GoUUID:d300fdb7-d337-4c64-8e31-7ff02889d9fb}]}}] Timeout:<nil> Where:[where column _uuid == {95072525-7d91-4f1f-9ad8-38a4171b978e}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:Logical_Switch_Port Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {d300fdb7-d337-4c64-8e31-7ff02889d9fb}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] results [{Count:1 Error: Details: UUID:{GoUUID}

      Rows:[]} {Count:1 Error: Details: UUID:

      {GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID}

      Rows:[]} {Count:0 Error:referential integrity violation Details:cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s) UUID:

      {GoUUID:} Rows:[]}] and errors []: referential integrity violation: cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s)
      1h2m Warning ErrorAddingLogicalPort pod/ngat-dev-build failed to ensurePod mx-ngat-dev/ngat-dev-build since it is not yet scheduled
      Unknown Normal Scheduled pod/ngat-dev-build Successfully assigned mx-ngat-dev/ngat-dev-build to demchdc5vvx
      1h2m Normal AddedInterface pod/ngat-dev-build Add eth0 [10.195.80.33/24] from ovn-kubernetes
      1h2m Normal Pulling pod/ngat-dev-build Pulling image "private-cloud.registry.mendix.com/image-builder:2.2.0"
      1h2m Normal Pulled pod/ngat-dev-build Successfully pulled image "private-cloud.registry.mendix.com/image-builder:2.2.0" in 599.594895ms
      1h2m Normal Created pod/ngat-dev-build Created container mendix-build
      1h2m Normal Started pod/ngat-dev-build Started container mendix-build
      Unknown Normal Scheduled pod/ngat-dev-build Successfully assigned mx-ngat-dev/ngat-dev-build to demchdc6z2x
      28m Warning ErrorAddingLogicalPort pod/ngat-dev-build failed to ensurePod mx-ngat-dev/ngat-dev-build since it is not yet scheduled
      27m Normal AddedInterface pod/ngat-dev-build Add eth0 [10.195.76.114/24] from ovn-kubernetes
      27m Normal Pulling pod/ngat-dev-build Pulling image "private-cloud.registry.mendix.com/image-builder:2.2.0"
      27m Normal Pulled pod/ngat-dev-build Successfully pulled image "private-cloud.registry.mendix.com/image-builder:2.2.0" in 575.907145ms
      27m Normal Created pod/ngat-dev-build Created container mendix-build
      27m Normal Started pod/ngat-dev-build Started container mendix-build
      26m Warning ErrorAddingLogicalPort pod/ngat-dev-build deleteLogicalPort failed for pod mx-ngat-dev_ngat-dev-build: cannot delete logical switch port mx-ngat-dev_ngat-dev-build, error in transact with ops [{Op:mutate Table:Address_Set Row:map[] Rows:[] Columns:[] Mutations:[{Column:addresses Mutator:delete Value:{GoSet:[10.195.76.114]}}] Timeout:<nil> Where:[where column _uuid == {961a2f8f-14bb-44f9-a945-0dc2211324c7}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:ports Mutator:delete Value:{GoSet:[{GoUUID:d300fdb7-d337-4c64-8e31-7ff02889d9fb}]}}] Timeout:<nil> Where:[where column _uuid == {bc92e86b-a1ae-4771-8e59-0588657bf70e}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:Logical_Switch_Port Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {d300fdb7-d337-4c64-8e31-7ff02889d9fb}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] results [{Count:1 Error: Details: UUID:{GoUUID}

      Rows:[]} {Count:1 Error: Details: UUID:

      {GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID}

      Rows:[]} {Count:0 Error:referential integrity violation Details:cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s) UUID:

      {GoUUID:} Rows:[]}] and errors []: referential integrity violation: cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s)
      26m Warning ErrorAddingLogicalPort pod/ngat-dev-build failed to ensurePod mx-ngat-dev/ngat-dev-build since it is not yet scheduled
      Unknown Normal Scheduled pod/ngat-dev-build Successfully assigned mx-ngat-dev/ngat-dev-build to demchdc6z2x
      26m Normal AddedInterface pod/ngat-dev-build Add eth0 [10.195.76.114/24] from ovn-kubernetes
      26m Normal Pulling pod/ngat-dev-build Pulling image "private-cloud.registry.mendix.com/image-builder:2.2.0"
      26m Normal Pulled pod/ngat-dev-build Successfully pulled image "private-cloud.registry.mendix.com/image-builder:2.2.0" in 606.11715ms
      26m Normal Created pod/ngat-dev-build Created container mendix-build
      26m Normal Started pod/ngat-dev-build Started container mendix-build
      26m Warning ErrorAddingLogicalPort pod/ngat-dev-build deleteLogicalPort failed for pod mx-ngat-dev_ngat-dev-build: cannot delete logical switch port mx-ngat-dev_ngat-dev-build, error in transact with ops [{Op:mutate Table:Address_Set Row:map[] Rows:[] Columns:[] Mutations:[{Column:addresses Mutator:delete Value:{GoSet:[10.195.76.114]}}] Timeout:<nil> Where:[where column _uuid == {961a2f8f-14bb-44f9-a945-0dc2211324c7}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:ports Mutator:delete Value:{GoSet:[{GoUUID:d300fdb7-d337-4c64-8e31-7ff02889d9fb}]}}] Timeout:<nil> Where:[where column _uuid == {bc92e86b-a1ae-4771-8e59-0588657bf70e}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:Logical_Switch_Port Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {d300fdb7-d337-4c64-8e31-7ff02889d9fb}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] results [{Count:1 Error: Details: UUID:{GoUUID}

      Rows:[]} {Count:1 Error: Details: UUID:

      {GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID}

      Rows:[]} {Count:0 Error:referential integrity violation Details:cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s) UUID:

      {GoUUID:} Rows:[]}] and errors []: referential integrity violation: cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s)
      22m Warning ErrorAddingLogicalPort pod/ngat-dev-build failed to ensurePod mx-ngat-dev/ngat-dev-build since it is not yet scheduled
      Unknown Normal Scheduled pod/ngat-dev-build Successfully assigned mx-ngat-dev/ngat-dev-build to demchdc6z2x
      22m Normal AddedInterface pod/ngat-dev-build Add eth0 [10.195.76.114/24] from ovn-kubernetes
      22m Normal Pulling pod/ngat-dev-build Pulling image "private-cloud.registry.mendix.com/image-builder:2.2.0"
      22m Normal Pulled pod/ngat-dev-build Successfully pulled image "private-cloud.registry.mendix.com/image-builder:2.2.0" in 534.430683ms
      22m Normal Created pod/ngat-dev-build Created container mendix-build
      22m Normal Started pod/ngat-dev-build Started container mendix-build
      21m Warning ErrorAddingLogicalPort pod/ngat-dev-build deleteLogicalPort failed for pod mx-ngat-dev_ngat-dev-build: cannot delete logical switch port mx-ngat-dev_ngat-dev-build, error in transact with ops [{Op:mutate Table:Address_Set Row:map[] Rows:[] Columns:[] Mutations:[{Column:addresses Mutator:delete Value:{GoSet:[10.195.76.114]}}] Timeout:<nil> Where:[where column _uuid == {961a2f8f-14bb-44f9-a945-0dc2211324c7}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:ports Mutator:delete Value:{GoSet:[{GoUUID:d300fdb7-d337-4c64-8e31-7ff02889d9fb}]}}] Timeout:<nil> Where:[where column _uuid == {bc92e86b-a1ae-4771-8e59-0588657bf70e}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:} {Op:delete Table:Logical_Switch_Port Row:map[] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {d300fdb7-d337-4c64-8e31-7ff02889d9fb}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}] results [{Count:1 Error: Details: UUID:{GoUUID}

      Rows:[]} {Count:1 Error: Details: UUID:

      {GoUUID:} Rows:[]} {Count:1 Error: Details: UUID:{GoUUID}

      Rows:[]} {Count:0 Error:referential integrity violation Details:cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s) UUID:

      {GoUUID:}

      Rows:[]}] and errors []: referential integrity violation: cannot delete Logical_Switch_Port row d300fdb7-d337-4c64-8e31-7ff02889d9fb because of 1 remaining reference(s)

      • Eventually the pod seems to start but then fails again with the same TCP timeout:

      2022-08-11T08:26:24.264204783Z time="2022-08-11T08:26:24Z" level=error msg="Failed to build mendix app, failed to create application layer failed to download MDA from https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/ef924516-7121-4372-a306-b0055c445766/71268, Get \"https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/ef924516-7121-4372-a306-b0055c445766/71268\": proxyconnect tcp: dial tcp: i/o timeout: Get \"https://privatecloud.mendixcloud.com/rest/mdarepository/v1/download/ef924516-7121-4372-a306-b0055c445766/71268\": proxyconnect tcp: dial tcp: i/o timeout"

      • Customer took a sosreport from the node where the pod terminated in case is worth it to check:

      https://attachments.access.redhat.com/hydra/rest/cases/03096770/attachments/bd3f86b2-6d2a-44f7-bb0f-eec85630586d?usePresignedUrl=true

      — Additional comment from Tim Rozet on 2022-08-11 21:19:12 UTC —

      We were able to reproduce the issue locally. There are two potential paths that can cause a stale logical switch port to be re-used on the wrong node:
      Scenario 1: pod is created, deleted, and recreated on another node extremely quickly. This plays out like this:
      Events:
      1. pod toni is created on node A
      2. pod toni is deleted on node A
      3. pod toni is recreated on node B

      What happens in ovnk:
      1. pod toni is created on node B
      2. pod toni is deleted on node A (fails)

      This happens because we grab the latest version of the pod to add in event 1, which by the time we grab it is actually the value from event 3. Then after processing event 1, we move to event 2 and when we delete the pod, we use what was given to us in the event, which is now incorrect to remove the pod from node A, because it actually is on node B. The fix is to ignore the value in the pod spec on deletion, and use what we store internally in our internal port cache. If there is no entry in the cache, we search OVN NBDB to find the right switch (this is an expensive operation so we want to avoid it when possible).

      Flavio is working on a fix for this.

      Scenario 2: pod is created, runs to completion, is deleted very quickly, and then is recreated on another node:
      Events:
      1. pod toni is created on node A
      2. pod toni runs to completion
      3. pod toni immediately is deleted
      4. pod toni is recreated on node B

      What happens in ovnk:
      1. pod toni is created on node A
      2. completion causes an update event, ovnk does not delete the pod (bug)
      3. deletion event is processed, but since it is a completed pod, we ignore it (since we should have already delted it in step 2)
      4. pod toni is recreated on node B, ovnk sees there is already an switch port for toni on the wrong switch, and just updates that with the new information

      This happens when a pod goes to completed, but is deleted very quickly. In step 2 during update event we try to grab the latest version of the pod, but it doesn't exist anymore since it was deleted. In this case we skip the update, instead of tearing down the pod. The delete code in step 3 assumes that if the pod is completed we must have already handled it in update, so the stale port stays around until an add later re-uses it. Patryk already has a fix for this:

      https://github.com/ovn-org/ovn-kubernetes/pull/3071

      There may still be more egress IP issues to investigate (tracked in other bugs) after fixing this, and we will look into those after fixing these fundamental pod issues.

      — Additional comment from Tim Rozet on 2022-08-11 22:19:30 UTC —

      Andre, re: comment 5. This looks like the same issue. If you have a must gather we can confirm. I think the sosreport from the worker node is not enough. The referential integrity violation occurs because we attempt to do these operations:
      1. remove the logical switch port from the logical switch
      2. delete the logical switch port

      In this case:
      1. we remove the logical switch port from the logical switch, but the switch port actually exists on a different switch (node) - this is a noop
      2. we try to delete the logical switch port- NBDB complains this is an violation and refuses to delete it, because there is another switch that holds a reference to this object

      — Additional comment from Anurag saxena on 2022-08-12 21:16:56 UTC —

      @rbrattai@redhat.com Can you help verifying this and backports? feel free to re-assign to someone else if needed while I am on PTO. Thanks!

      Attachments

        Activity

          People

            ffernand@redhat.com Flavio Fernandes
            ffernand@redhat.com Flavio Fernandes
            Ross Brattain Ross Brattain
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: