-
Feature
-
Resolution: Done
-
Major
-
None
Feature Overview (aka. Goal Summary)
MicroShift can recover from manual backups in case startup fails, e.g. due to corrupt etcd database or human errors.
Goals (aka. expected user outcomes)
The goal of this feature is to provide an addition layer of protection and robustness for edge devices, esp. against sudden loss of power or user errors. Users/admin can create manual backups using `microshift backup` (e.g. with a daily cron job). They then can point MicroShift to the folder with those backups. In case of a startup-failure, MicroShift will restore the backups (newest first) and try to start with that.
Requirements (aka. Acceptance Criteria):
- Provide a way to configure "autoRecover"
- AutoRecover must ensure backup fits to version/ostree commit (like we have with backup/restore for updates)
- Integrate with existing back/restore logic used during updates
- The failing config needs to be backed up, for later post-mortem analysis on why it failed.
- All steps/tries need to be logged very verbose and explicitly, so that the sequence of events can be re-constructed easily.
- Try all available backups from newest first to oldest last.
Questions to Answer (Optional):
- How to integrate with greenboot / ostree rollbacks? A: Probably not an issue, as rollbacks / restores are triggered only during active upgrade (when a new commit is staged).
- Do we try only that latest backup? Or if there are multiple suitable, proceed with older ones? A: Should try to have a list of backups that is workd newest to oldest.
- How to avoid any potential conflicts with the automatic backups created for update/rollbacks A: Probably not an issue, see Q#1
Out of Scope
- Creating of the backups - only the user/customer knows when there is a good time for this, as microshift needs to be stopped for the backup.
- Keep control on how many backups will be kept on disk - that is the duty of the user/customer.
- Make the decision on when to trigger a restore of the backup. user needs to provide a script for that (could be re-used from greenboot, but might also be something else). Its the responsibility of the customer to make that decision.
Background
While there are already lots of protection layers (xfs, etcd bbolt backend incl. robustness tests), according to murphy's law, it still will go wrong at some point in time.
See here for example https://issues.redhat.com/browse/OCPBUGS-28380
Customer Considerations
Probably want to review design with key customers.
Documentation Considerations
Documentation in the back/restore section needs to be augmented with a new chapter on "auto-recovery"
Interoperability Considerations
none