Overview:
A high level summary that describes the Epic in a clear, concise way. Complete during New status.
Part of ROX-32124.
One of the biggest issues with central's availability is that for larger deployments migration downtimes lead to unacceptable long downtimes of the API. This is blocking customers CI and/or prevents them from putting ACS scans into critical build paths.
This Epic should address that issue by making sure old central pods can keep running while a new version is rolling out doing database migrations.
Requirements:
A list of specific needs or objectives that an epic must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Technical Scope:
High-level list of items that are in scope; usually completed by a staff engineer or a lead from the Feature Delivery Team. Initial completion during Refinement status.
- Central's deployment strategy needs to change from Recreate to RollingUpdate
- DB migration process needs to be aware that other processes could start migrations and prevent conflicts
- Background workers need to be aware that multiple centrals can be running for the upgrade period, so that they don't conflict or do the same work twice
- DB migration and review guidlines should be updated
- Verify we don't exclusively lock tables
- Verify the migrations are implemented in a way that they can handle new data being added or existing data being modified or deleted during migration
- There have to be CI tests to ensure:
- DB schema is backwards compatible (we already have this)
- Old central can keep running while new central is in migration without major conflicts or introducing data inconsistencies
- Addressing the backfill / index creation migrations issue
- To prevent central RollingUpdate from being perceived as a stuck rollout, migrations should finish in reasonable time (<10 min maybe even less)
- Migrations backfilling data on large tables or creating indices can take long, even up to hours
- We have the same issue today with the Recreate rollout, which this is listed as optional
Out of Scope:
High-level list of items that are out of scope. Initial completion during Refinement status.
- Having multiple central pods outside the upgrade process
Outstanding Questions (Optional):
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>