Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-65584

ovnkube-master does not detect new ovn-controller's PID - MicroShift startup takes ~7 minutes

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 0
    • Important
    • None
    • None
    • None
    • uShift Sprint 279, uShift Sprint 280
    • 2
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      FIX IS BEING WORKED ON IN OVNK UPSTREAM.
      FOR TIME BEING RECOMMENDED SOLUTION: revert and lock ovnk image for microshift rebases, so ART's rebases do not overwrite to faulty ref.
      
      
      Most recent ovn-k image is causing some troubles for MicroShift.
      
      The root cause is this commit: https://github.com/openshift/ovn-kubernetes/commit/2871cdae138b10a282a3a930c5f5c516f9c523bc
      
      Linked commit changed code to use `RunOVNControllerAppCtl` which has one slight problem: the file with PID is only read once and uses the same PID for 200 retries (every 2 seconds -> almost 7 minutes before it gives up).
      
      For some reason, on MicroShift, the ovn-controller always fail to start cleanly on first boot: it means that after restart it creates a new socket with new PID in its filename.
      When ovnkube-master reads file with PID, it doesn't see that the PID changed and there's new socket to use.
      
      Because it tries to use a socket that exists (ovn-c did not delete it), but there's nothing on the other end, it continuously gets "Connection refused" error.
      
      MicroShift becomes healthy again after ovnkube-master runs out of retries, quits, and restarts (so in ~7 minutes).
      
      
      

      Version-Release number of selected component (if applicable):

          

      How reproducible:

      Every time    

      Steps to Reproduce:

          1. Deploy MicroShift main
          2. Observe ovn-k pods & logs
         
          

      Actual results:

          ovn-k takes a long time to start, which prevents all other (non hostnetwork) Pods from starting

      Expected results:

          ovnk sees that PID was updated and switches to using correct socket

      Additional info:

      See that ovnkube-master restarted 7 minutes after ovnkube-node:
      # oc get pods -n openshift-ovn-kubernetes
      NAME                   READY   STATUS    RESTARTS      AGE
      ovnkube-master-l2d7x   4/4     Running   1 (11m ago)   18m
      ovnkube-node-r8m7x     1/1     Running   1 (18m ago)   18m
      
      
      
      ~7 minute log of ovnkube-master continously reading the same socket file: https://drive.google.com/file/d/18MLEV2HkcJClWVn0Dq8Z5X0p5e3Hy_0R/view?usp=sharing
      
      SOS report: https://drive.google.com/file/d/1omPiTfTpZ3oCpiYN4MaPELQoc_V-RE3W/view?usp=sharing 

              pmatusza@redhat.com Patryk Matuszak
              pmatusza@redhat.com Patryk Matuszak
              None
              None
              John George John George
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: