Loading...

Type: Outcome
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Component/s: Central, Collector, Scanner & Vulnerability Feeds , Sensor
Labels:
- 4.11.0

Activity Type:
Future Sustainability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Color Status:
Not Selected
Hierarchy Progress Bar:

71% To Do, 0% In Progress, 29% Done
Intelligence Requested:
Market:
Product Documentation Required:
Yes

Target Version:

4.11.0

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Summary:

Ensure Red Hat Advanced Cluster Security (RHACS) provides continuous security posture and swift recovery from failures or disasters, by addressing critical data loss and operational continuity challenges during Central outages. This empowers customers, particularly those with stringent requirements like financial institutions, to deploy RHACS with confidence in critical production environments and recover swiftly from catastrophic events.

Problem Statement/Context:
The current implementation of RHACS lacks robust HA/DR capabilities, with documentation primarily detailing only backup and restore procedures. This presents significant challenges for customers:

Customer DR Gap: A customer's initial plan to simply spin up a new cluster and re-introduce secured clusters to a newly built Central is unacceptable due to the resulting loss of historical data from the old database.
Cluster Name & Identity Issues: The idea to backup a Central instance and restore it in a new cluster would conceptually work if the secondary cluster maintained the same name. However, cluster DNS often differs, which is assumed to create issues with both certificates and data in the database, as the database contains the original cluster name.
Limited Central Architecture: Unlike other security scanning tools (e.g., Nessus/ACAS), RHACS does not support connecting to more than one Central or a tiered Central architecture. This prevents scenarios where a disconnected site could have a local Central capturing all events
Singleton Deployment Limitations: RHACS Central follows a singleton deployment model without built-in support for automated failover or leader election. This results in operational risk, as Central becomes a single point of failure. Customers cannot deploy multiple Centrals with coordinated failover, making it difficult to achieve real high availability in production environments. In outage scenarios, service continuity depends on manual intervention and recovery, which increases MTTR and jeopardizes security visibility.
Build-Time Policy Continuity: If Central is down and roxctl is configured across CI/CD pipelines, all builds will break for the duration of Central being offline. This severely impacts development velocity and requires significant operational overhead during outages.

Violations Continuity: Customers need to continue receiving violations to their SIEM or other destinations even when Central is down. This is crucial for ongoing incident response and audit trails.

Collector Stability Issues: Some customers have seen specific operational issues they have observed where collectors enter a CrashLooping state, generating false positives and requiring manual silencing of alerts. This impacts the reliability of security alerts and can lead to missing real problems, suggesting a need for increased collector resilience and configurability around startup retries.

Expected end user outcomes :

Security and platform administrators will be able to deploy and configure RHACS for high availability, ensuring core security services (e.g., Central, Scanner, Collector communication) maintain operation even if individual components fail. They will also be able to implement effective disaster recovery strategies, including solutions for restoring RHACS instances and their data across clusters with differing names. This is critical for customers who require a robust DR plan to sign off on production deployments.

Success Criteria or KPIs measured:{}

The metrics below are based on assumption that customer will follow best practices and guidance we provide.

Reduction in Mean Time To Recovery (MTTR) for RHACS outages after implementing HA/DR.
Percentage of RHACS deployments utilizing supported HA configurations in production.
No loss of security events or history reported by collectors during simulated Central disconnection and reconnection events.
Zero reported CI/CD pipeline breaks directly attributable to Central downtime after fallback mechanisms are enabled.
Customer satisfaction scores related to RHACS uptime, resilience, and alert accuracy.
Compliance with defined RTO/RPO targets in test and production environments.

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios together with user type/persona. Initial completion during Refinement status.

Out of Scope (Optional):

High-level list of items that are out of scope. Initial completion during Refinement status.

**

clones

ROX-29459 [Discovery] Document RHACS Scalability and Reliability Best Practices for Customers

Closed

is depended on by

RFE-5897 Need HA/DR capabilities for ACS

Waiting

RFE-7982 ScannerDB with a BYODB (Bring Your Own Database) option -ACS

Waiting

is related to

RFE-7828 Enable Automatic HTTPS Redirection for RHACS Central Route

Closed

Details

Description

Summary:

Expected end user outcomes :

Success Criteria or KPIs measured:{}

Use Cases (Optional):

Out of Scope (Optional):

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates