Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-8117

Bad timing of OVNDbCluster termination during db initialization may leave an empty db file on disc and block service startup

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Blocker Blocker
    • rhos-18.0.0
    • None
    • ovn-operator
    • None
    • Important

      OSPCIX-342 revealed that the service will fail to start if the db file is present on disc but empty. The reason why the empty file is on disc is because the previous start of the pod was interrupted in the middle of db file initialization. The interrupt was triggered by configuration change.

      The reason why the empty file is left is because we don't propagate SIGTERM to ovsdb-tool that creates the file. Instead, SIGTERM is caught by the shell script, which makes the script exit before ovsdb-tool is complete. This in turn results in SIGKILL sent to the tool, leaving the file in inconsistent state.

       

      There are several issues to resolve here:

       

      • when an empty db file is present, we should be able to detect it and remove it before proceeding with configuration. (This could also be handled in ovs-lib.in in OVS but would require patching Open vSwitch package.)
      • we run dumb-init with --single-child, which doesn't send SIGTERM to children of children, nor we have a SIGTERM handler in the shell start script. The fix should involve sending SIGTERM to ovsdb-tool, and removing --single-child achieves this. - This will be addressed in https://issues.redhat.com/browse/OSPRH-8212
      • Removing --single-child should probably help, but is not enough, because dump-init does not wait for children-of-children to exit (only for the main child). So the script should wait for ovsdb-tool to complete. - This will be addressed in https://issues.redhat.com/browse/OSPRH-8212

       

      A long discussion of this case is here: https://redhat-internal.slack.com/archives/C046JULBVJ7/p1719396554432979

      Some background on how children are handled in containers (for docker but should apply elsewhere): https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/

              ihrachys Ihar Hrachyshka
              ihrachys Ihar Hrachyshka
              Maor Blaustein Maor Blaustein
              rhos-dfg-networking-squad-neutron
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: