Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-32250

Router doesn't resync after contention period

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.13, 4.12, 4.14, 4.15
    • Networking / router
    • Important
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Version-Release number of selected component (if applicable):

      The openshift-router doesn't update status nor does it re-admit the route after a period of contention.
      
      The openshift-router has a default resync period of 30 minutes. In a situation in which routers disagree and post conflicting status on a route, the routers will enter a contention state. The routers will back off from writing updates to avoid an infinite loop of updating status.
      
      There will always be 1 "winner" router pod and N-number of "loser" router pod(s) where the "winner" successfully wrote the last update to the status before the routers backed off (i.e., it won).
      
      However, I noticed that if I resolve the contention by deleting the "winner" and leaving 1 loser, the loser never updates the route status, even after the router's resync is triggered.
      
      Debugging the code, there is a plugin chain with the router, and the plugin chain follows this order:
      1. HostAdmitter
      2. UniqueHost
      3. ExtendedValidator
      ...etc
      It appears that during a resync, the plugin change gets stopped at UniqueHost.
      
      In UniqueHost, I found that we have custom Route Index code (hostindex.go). It appears if the route doesn't "activate", I think meaning, it "changed" defined by the logic here (existing.ResourceVersion == route.ResourceVersion), it won't be passed to the other plugins. The 30m resync doesn't trigger a route ResourceVersion change.
      
      This seems very odd that the UniqueHost plugin is preventing all of the other plugins (including the ones that update the status...) from doing their job because the route isn't "activated".

      How reproducible:

        100%  

      Steps to Reproduce:

      I wrote a script to test and expose this bug. It reduces the resync to 1 minute, so a test might take +1 minute. You can change the image to which ever router version you want to test:   
          1. wget https://gist.githubusercontent.com/gcs278/949b1c5a5cabf7bb271c83f760ebf61a/raw/6d7516c6806b2961757d6ac3ea80204e9e8ceaca/router-contention-resync-test.sh     

      Actual results:

          Routes fail to resync status

      Expected results:

          Routes should resync status

      Additional info:

          

            gspence@redhat.com Grant Spence
            gspence@redhat.com Grant Spence
            Hongan Li Hongan Li
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: