• Icon: Epic Epic
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • None
    • Central
    • Central HA Phase 1: Minimal downtime upgrades
    • Product / Portfolio Work
    • False
    • Hide

      None

      Show
      None
    • False
    • Not Selected
    • To Do
    • ROX-32124 - [Discovery] High Availability Scanning Architecture Research
    • 100% To Do, 0% In Progress, 0% Done
    • Hide

      2026-02-24:

      • Addressing some design review comments while waiting for the final technical review meeting on Feb 26th

      2026-02-17:

      Show
      2026-02-24: Addressing some design review comments while waiting for the final technical review meeting on Feb 26th 2026-02-17: Proposal for changes of the feature ROX-32124 presented and signed off at ACS ENG / PM sync Technical design for minimal downtime upgrades under review: https://docs.google.com/document/d/1kRyn96HcL7O6Eje8ptYikKjnUfM3AxL3nxWhU-Q-Fj0

      Overview:

      A high level summary that describes the Epic in a clear, concise way. Complete during New status.

      Part of ROX-32124.

      One of the biggest issues with central's availability is that for larger deployments migration downtimes lead to unacceptable long downtimes of the API. This is blocking customers CI and/or prevents them from putting ACS scans into critical build paths.

      This Epic should address that issue by making sure old central pods can keep running while a new version is rolling out doing database migrations.

      Requirements:

      A list of specific needs or objectives that an epic must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.

      •  

      Technical Scope:

      High-level list of items that are in scope; usually completed by a staff engineer or a lead from the Feature Delivery Team. Initial completion during Refinement status.

      • Central's deployment strategy needs to change from Recreate to RollingUpdate
      • DB migration process needs to be aware that other processes could start migrations and prevent conflicts
      • Background workers need to be aware that multiple centrals can be running for the upgrade period, so that they don't conflict or do the same work twice
      • DB migration and review guidlines should be updated
        • Verify we don't exclusively lock tables
        • Verify the migrations are implemented in a way that they can handle new data being added or existing data being modified or deleted during migration
      • There have to be CI tests to ensure:
        • DB schema is backwards compatible (we already have this)
        • Old central can keep running while new central is in migration without major conflicts or introducing data inconsistencies
      • Addressing the backfill / index creation migrations issue
        • To prevent central RollingUpdate from being perceived as a stuck rollout, migrations should finish in reasonable time (<10 min maybe even less)
        • Migrations backfilling data on large tables or creating indices can take long, even up to hours
        • We have the same issue today with the Recreate rollout, which this is listed as optional

      Out of Scope:

      High-level list of items that are out of scope. Initial completion during Refinement status.

      • Having multiple central pods outside the upgrade process

      Outstanding Questions (Optional):

      Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

      <your text here>

              rh-ee-jmalsam Johannes Malsam
              rh-ee-jmalsam Johannes Malsam
              ACS Cloud Service
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: