Loading...

XML

Word

Printable

Type: Epic
Resolution: Unresolved
Priority: Major
Fix Version/s: ConsoleDot CY25Q3
Affects Version/s: None
Component/s: SPURS
Labels:
- platform-accessmanagement

Epic Name:
Debezium hardening
Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
BZ requires_doc_text:
Unset
Epic Status:
In Progress
Hierarchy Progress Bar:

20% To Do, 20% In Progress, 60% Done
BZ Keywords:
- Unset
Intelligence Requested:
Market:
PX Impact Score:

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

User Story

As the Management Fabric Engineer and service owner, I want to ensure our Debezium-dependent services and related infrastructure are properly configured and monitored to avoid customer impacting outages. Replication issues, including replication slot drops, will require re-syncing data and access will be incorrect in the meantime which can lead to privilege escalation or invalid denials.

Acceptance Criteria

Databases that are consumed by Debezium are hardened with optimal settings for replication
Debezium services (Kafka Connect) are actively monitored in AppSRE Observability Stack, with Alerts configured to ensure paging of Engineers
Debezium dependent services (Consumers, Connectors) are actively monitored in AppSRE Observability Stack, with Alerts configured to ensure paging of Engineers

Potential places to start:

Remove WAL size limit

The primary reason for losing a slot is when its lag grows too great. To remove this, we need to remove the WAL size limit. This allows the WAL to grow infinitely. This introduces its own problems, but we can address these in turn.

Monitor end to end replication

To ensure replication is responsive, it must be monitored end to end.

Auto-scale DB storage

When the WAL can grow unbounded, it can consume all available storage and prevent both read and write requests to the DB. In other words, it can cause a full outage. To avoid this, we can auto-scale the DB storage.

Monitor replication lag and/or storage size

If we auto-scale the DB storage, the next problem is unbounded costs due to unbounded WAL growth. To solve this we can monitor for replication lag, which we already do. This way we know when we are growing the WAL way before it causes excessive costs or even requires scaling the storage, and can intervene accordingly. We can additionally monitor for storage size or just cost directly.

Customer facing status for access replication

If we can have replication problems, it would be good to let customers know if this is severely degraded, so they can have that context when configuring access policies (e.g. entry on status.redhat.com for this, named in a way that is clear for customers).

Alternatives

Automatically disable writes at a certain replication lag or free database disk space
This would require monitoring for lag/storage in the database from RBAC, and automatically rejecting write requests if it crosses a certain threshold. This is sort of a hard gate on cost growth at the cost of degraded service.

Default Done Criteria

All existing/affected SOPs have been updated.
New SOPs have been written.
The feature has both unit and end to end tests passing in all test
pipelines and through upgrades.
If the feature requires QE involvement, QE has signed off.
Any updates are replicated to FedRAMP environment where applicable

References

Doc: https://docs.google.com/document/d/1pS0usgMrNR5gFNrJeiFUOkbRVdb-7uFeAdKnOsS0FqU/edit?tab=t.0

is related to

RHCLOUD-39616 Internal Data Consistency Hardening

Closed

Assignee:: Unassigned

Reporter:: Jay Zeng

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2025/02/19 5:32 PM

Updated:: 2025/08/28 11:56 AM

Details

Description

User Story

Acceptance Criteria

Potential places to start:

Remove WAL size limit

Monitor end to end replication

Auto-scale DB storage

Monitor replication lag and/or storage size

Customer facing status for access replication

Alternatives

Default Done Criteria

References

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide