XMLWordPrintable

    • Debezium hardening
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • False
    • Unset
    • In Progress
    • 20% To Do, 20% In Progress, 60% Done

      User Story

      As the Management Fabric Engineer and service owner, I want to ensure our Debezium-dependent services and related infrastructure are properly configured and monitored to avoid customer impacting outages. Replication issues, including replication slot drops, will require re-syncing data and access will be incorrect in the meantime which can lead to privilege escalation or invalid denials.

      Acceptance Criteria

      • Databases that are consumed by Debezium are hardened with optimal settings for replication
      • Debezium services (Kafka Connect) are actively monitored in AppSRE Observability Stack, with Alerts configured to ensure paging of Engineers
      • Debezium dependent services (Consumers, Connectors) are actively monitored in AppSRE Observability Stack, with Alerts configured to ensure paging of Engineers

      Potential places to start:

      Remove WAL size limit

      The primary reason for losing a slot is when its lag grows too great. To remove this, we need to remove the WAL size limit. This allows the WAL to grow infinitely. This introduces its own problems, but we can address these in turn.

      Monitor end to end replication

      To ensure replication is responsive, it must be monitored end to end.

      Auto-scale DB storage

      When the WAL can grow unbounded, it can consume all available storage and prevent both read and write requests to the DB. In other words, it can cause a full outage. To avoid this, we can auto-scale the DB storage.

      Monitor replication lag and/or storage size

      If we auto-scale the DB storage, the next problem is unbounded costs due to unbounded WAL growth. To solve this we can monitor for replication lag, which we already do. This way we know when we are growing the WAL way before it causes excessive costs or even requires scaling the storage, and can intervene accordingly. We can additionally monitor for storage size or just cost directly.

      Customer facing status for access replication

      If we can have replication problems, it would be good to let customers know if this is severely degraded, so they can have that context when configuring access policies (e.g. entry on status.redhat.com for this, named in a way that is clear for customers).

      Alternatives

      Automatically disable writes at a certain replication lag or free database disk space
      This would require monitoring for lag/storage in the database from RBAC, and automatically rejecting write requests if it crosses a certain threshold. This is sort of a hard gate on cost growth at the cost of degraded service.

      Default Done Criteria

      • All existing/affected SOPs have been updated.
      • New SOPs have been written.
      • The feature has both unit and end to end tests passing in all test
        pipelines and through upgrades.
      • If the feature requires QE involvement, QE has signed off.
      • Any updates are replicated to FedRAMP environment where applicable

      References

      Doc: https://docs.google.com/document/d/1pS0usgMrNR5gFNrJeiFUOkbRVdb-7uFeAdKnOsS0FqU/edit?tab=t.0

              Unassigned Unassigned
              rh-ee-zhzeng Jay Zeng
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: