-
Epic
-
Resolution: Unresolved
-
Major
-
None
-
Debezium hardening
-
Quality / Stability / Reliability
-
False
-
-
False
-
Unset
-
In Progress
-
20% To Do, 20% In Progress, 60% Done
-
-
-
User Story
As the Management Fabric Engineer and service owner, I want to ensure our Debezium-dependent services and related infrastructure are properly configured and monitored to avoid customer impacting outages. Replication issues, including replication slot drops, will require re-syncing data and access will be incorrect in the meantime which can lead to privilege escalation or invalid denials.
Acceptance Criteria
- Databases that are consumed by Debezium are hardened with optimal settings for replication
- Debezium services (Kafka Connect) are actively monitored in AppSRE Observability Stack, with Alerts configured to ensure paging of Engineers
- Debezium dependent services (Consumers, Connectors) are actively monitored in AppSRE Observability Stack, with Alerts configured to ensure paging of Engineers
Potential places to start:
Remove WAL size limit
The primary reason for losing a slot is when its lag grows too great. To remove this, we need to remove the WAL size limit. This allows the WAL to grow infinitely. This introduces its own problems, but we can address these in turn.
Monitor end to end replication
To ensure replication is responsive, it must be monitored end to end.
Auto-scale DB storage
When the WAL can grow unbounded, it can consume all available storage and prevent both read and write requests to the DB. In other words, it can cause a full outage. To avoid this, we can auto-scale the DB storage.
Monitor replication lag and/or storage size
If we auto-scale the DB storage, the next problem is unbounded costs due to unbounded WAL growth. To solve this we can monitor for replication lag, which we already do. This way we know when we are growing the WAL way before it causes excessive costs or even requires scaling the storage, and can intervene accordingly. We can additionally monitor for storage size or just cost directly.
Customer facing status for access replication
If we can have replication problems, it would be good to let customers know if this is severely degraded, so they can have that context when configuring access policies (e.g. entry on status.redhat.com for this, named in a way that is clear for customers).
Alternatives
Automatically disable writes at a certain replication lag or free database disk space
This would require monitoring for lag/storage in the database from RBAC, and automatically rejecting write requests if it crosses a certain threshold. This is sort of a hard gate on cost growth at the cost of degraded service.
Default Done Criteria
- All existing/affected SOPs have been updated.
- New SOPs have been written.
- The feature has both unit and end to end tests passing in all test
pipelines and through upgrades. - If the feature requires QE involvement, QE has signed off.
- Any updates are replicated to FedRAMP environment where applicable
References
Doc: https://docs.google.com/document/d/1pS0usgMrNR5gFNrJeiFUOkbRVdb-7uFeAdKnOsS0FqU/edit?tab=t.0
- is related to
-
RHCLOUD-39616 Internal Data Consistency Hardening
-
- Closed
-