Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-30807

microshift-etcd unclean shutdown can cause start failures


    • No
    • 2
    • uShift Sprint 250
    • 1
    • False
    • Hide


    • Hide
      Previously, when `microshift-etcd` unexpectedly exited, MicroShift tried to restart so as to restart `microshift-etcd`, but there was a lingering unit fragment. Every attempt to restart `microshift-etcd` failed, making the system unusable. The `--collect` flag was added to the systemd-run invocation used to start `microshift-etcd`. The addition results in systemd cleaning the unit fragment even if the unit failed. The system now recovers and restarts.
      Previously, when `microshift-etcd` unexpectedly exited, MicroShift tried to restart so as to restart `microshift-etcd`, but there was a lingering unit fragment. Every attempt to restart `microshift-etcd` failed, making the system unusable. The `--collect` flag was added to the systemd-run invocation used to start `microshift-etcd`. The addition results in systemd cleaning the unit fragment even if the unit failed. The system now recovers and restarts.
    • Bug Fix
    • Done

      Description of problem:

      microshift-etcd died before systemd wanted to move it to cgroup.for unclear reasons.
      Perhaps the transient unit, that is supposed to either be active or not exist at all, kept existing so stopMicroShiftEtcdScopeIfExists()[0] didn't do its job,

       [0] https://github.com/openshift/microshift/blob/release-4.15/pkg/controllers/etcd.go#L150 


      Version-Release number of selected component (if applicable):


      How reproducible:

      sometimes happen in 4.15 nightly

      Steps to Reproduce:


      Actual results:

      microshift is unable to start

      Expected results:

      microshift should be able to recover from failed units states

      Additional info:

      Mar 11 22:58:00 el92-src-osconfig-host1 microshift[27914]: etcd W0311 22:58:00.924933   27914 etcd.go:121] microshift-etcd process terminated prematurely, restarting MicroShift
      Mar 11 22:58:00 el92-src-osconfig-host1 systemd[1]: microshift-etcd.scope: Couldn't move process 27945 to requested cgroup '/system.slice/microshift-etcd.scope': No such process
      Mar 11 22:58:00 el92-src-osconfig-host1 systemd[1]: microshift-etcd.scope: Failed to add PIDs to scope's control group: No such process
      Mar 11 22:58:00 el92-src-osconfig-host1 systemd[1]: microshift-etcd.scope: Failed with result 'resources'.
      Mar 11 22:58:00 el92-src-osconfig-host1 systemd[1]: Failed to start /usr/bin/microshift-etcd run.
      Mar 11 22:58:03 el92-src-osconfig-host1 microshift[28086]: etcd I0311 22:58:03.959900   28086 manager.go:120] "SERVICE STARTING" service="etcd"
      Mar 11 22:58:03 el92-src-osconfig-host1 microshift[28086]: etcd I0311 22:58:03.969509   28086 etcd.go:98] starting etcd via systemd-run with args [--uid=root --scope --unit microshift-etcd --property Before=microshift.service --property BindsTo=microshift.service /usr/bin/microshift-etcd run]
      Mar 11 22:58:03 el92-src-osconfig-host1 microshift[28116]: Failed to start transient scope unit: Unit microshift-etcd.scope was already loaded or has a fragment file.
      Mar 11 22:58:03 el92-src-osconfig-host1 microshift[28086]: etcd W0311 22:58:03.978156   28086 etcd.go:115] etcd failed waiting on process to finish: exit status 1
      Mar 11 22:58:03 el92-src-osconfig-host1 microshift[28086]: etcd I0311 22:58:03.978244   28086 etcd.go:117] etcd process quit: exit status 1
      Mar 11 22:58:03 el92-src-osconfig-host1 microshift[28086]: etcd W0311 22:58:03.978301   28086 etcd.go:121] microshift-etcd process terminated prematurely, restarting MicroShift


            pmatusza@redhat.com Patryk Matuszak
            eslutsky Evgeny Slutsky
            Douglas Hensel Douglas Hensel
            Shauna Diaz Shauna Diaz
            0 Vote for this issue
            4 Start watching this issue
