Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-11109

[4.12] Upgrade to 4.10 stalled on timeout completing syncEgressFirewall

XMLWordPrintable

    • ?
    • Critical
    • No
    • SDN Sprint 234
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Upgrade to 4.10 is stuck looping in syncEgressFirewall
      
      We see transacting operations with context deadline exceeded.
      
      It looks to be trying to process 2.8 million records is one go.
      
      2023-02-21T19:55:06.514097513Z I0221 19:55:06.435220 1 client.go:781] "msg"="transacting operations" "database"="OVN_Northbound" "operations"="[{Op:mutate Table:Logical_Switch Row:map[] Rows:[] Columns:[] Mutations:[{Column:acls Mutator:delete Value:{GoSet:[{GoUUID:6a3ad543-a77d-4700-83b8-5ccae6b2d067} ID:1c5297ff-8588-467a-93f4-22f22d609563} {GoUUID:f6288ed3-3928-45a8-ae57-40ed94cfa249} {GoUUID:04bf90c2-fde1-4a10-baaa-6a3f1d8e2931} {GoUUID:c6609536-857c-48ae-9125-9505753180 a8} {GoUUID:c79b4398-d7cc-4dcf-8c1d-11484f318324} {GoUUID:4323ac2c-033e-43c3-885b-e951cd7a4159} {GoUUID:7b316a80-076f-4266-b7d2-bd69b1d4b874} {GoUUID:57dfecb2-2f94-4cd8-a277-8 b28205e1048} {GoUUID:2c039f15-ff11-4ceb-aa82-bcbe82fc86d1} {GoUUID:063c4121-73c3-4d53-a89d-1063e775146b} {GoUUID:25c788e3-6146-4571-98bf-61010100a22a} {GoUUID:3d3c150f-1296-4d 91-b334-506f28bff4bd}]}}] Timeout:<nil> Where:[where column _uuid == {ba9652de-5aae-4a74-a512-29f775e38c19}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]: context deadline exceeded 
      
      2023-02-21T19:55:18.739739417Z E0221 19:55:18.643127       1 master.go:1369] Failed (will retry) in syncing syncEgressFirewall: failed to remove reject acl from node logical switches: error while removing ACLS: [6a3ad543-a77d-4700-83b8-5ccae6b2d067 8e004991-0382-455f-9901-33ef724acbc2 
      
      Everything is built into one operation via:
      https://github.com/openshift/ovn-kubernetes/blob/release-4.10/go-controller/pkg/libovsdbops/switch.go#L243
      
      TrandactAndCheck is being called with a 10s timeout and this operation never completes. 
       

      Version-Release number of selected component (if applicable):

      4.10.50

      How reproducible:

       

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

       

      Expected results:

      Upgrade completes

      Additional info:

       

            npinaeva@redhat.com Nadia Pinaeva
            rhn-support-mrobson Matt Robson
            Huiran Wang Huiran Wang
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: