Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-1620

Bug 2076297 - Router process ignores shutdown signal while starting up

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • None
    • 4.9
    • Networking / router
    • None
    • Important
    • None
    • 2
    • Sprint 225, Sprint 226
    • 2
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      Cause: The openshift-router process was ignoring the SIGTERM shutdown signal for a brief moment as it was starting.

      Consequence: If a SIGTERM was sent while the router process was starting up, it would ignore the signal. This means the container would ignore a kubernetes shutdown request resulting in the container taking 1 hour to shutdown (terminationGracePeriodSeconds).

      Fix: Propagate the SIGTERM handler in GO code to the cache initialization function.

      Result: The router now responds to SIGTERM signals during it initialization.
      Show
      Cause: The openshift-router process was ignoring the SIGTERM shutdown signal for a brief moment as it was starting. Consequence: If a SIGTERM was sent while the router process was starting up, it would ignore the signal. This means the container would ignore a kubernetes shutdown request resulting in the container taking 1 hour to shutdown (terminationGracePeriodSeconds). Fix: Propagate the SIGTERM handler in GO code to the cache initialization function. Result: The router now responds to SIGTERM signals during it initialization.

      Manually creating for 4.9 backport

      Description of problem:
      For brief window while the openshift-router binary is starting up, it ignores shutdown signals (SIGTERMs) and will never shutdown.

      This becomes a larger issue when K8S sends a graceful shutdown while the router is starting up and subsequently waits the terminationGracePeriodSeconds as specified in the router deployment, which is 1 hour.

      This becomes even more of an issue with
      https://github.com/openshift/cluster-ingress-operator/pull/724
      which makes the ingress controller wait for all pods before deleting itself. So if these pods are stuck in Terminating for an hour, then the ingress controller will be stuck in Terminating for an hour.

      OpenShift release version:

      Cluster Platform:

      How reproducible:
      You can start/stop the router pod quickly to get it to be stuck in a hour-long Terminating state.

      Steps to Reproduce (in detail):
      1. Create a YAML file with the following content:

      apiVersion: v1
      items:

      • apiVersion: operator.openshift.io/v1
        kind: IngressController
        metadata:
        name: loadbalancer
        namespace: openshift-ingress-operator
        spec:
        replicas: 1
        routeSelector:
        matchLabels:
        type: loadbalancer
        endpointPublishingStrategy:
        type: LoadBalancerService
        nodePlacement:
        nodeSelector:
        matchLabels:
        node-role.kubernetes.io/worker: ""
        status: {}
        kind: List
        metadata:
        resourceVersion: ""
        selfLink: ""

      2. Run the following command:

      oc apply -f <YAML_FILE>.yaml && while ! oc get pod -n openshift-ingress | grep -q router-loadbalancer; do echo "Waiting"; done; oc delete pod -n openshift-ingress $(oc get pod -n openshift-ingress --no-headers | grep router-loadbalancer | awk '{print $1}');

      It is considered a failure if it hangs for more than 45 seconds. You can ctrl-c after it deletes the pod and run "oc get pods -n openshift-ingress" to see that it is stuck in a terminating state with a AGE longer than 45 seconds.

      The pod will take 1 hour to terminate, but you can always clean up by force deleting it.

      Actual results:
      Pod takes 1 hour to be deleted.

      Expected results:
      Pod should be deleted in about 45 seconds.

      Impact of the problem:
      Router pods hang in terminating for 1 hour and that will affect user experience.

      Additional info:
       

              gspence@redhat.com Grant Spence
              gspence@redhat.com Grant Spence
              Shudi Li Shudi Li
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: