Uploaded image for project: 'Red Hat Advanced Cluster Security'
  1. Red Hat Advanced Cluster Security
  2. ROX-30042

RHACS Enhanced Reliability and Scalability with HA/DR considerations

    • Future Sustainability
    • False
    • Hide

      None

      Show
      None
    • False
    • Not Selected
    • 75% To Do, 0% In Progress, 25% Done
    • Yes

      Summary:

      Ensure Red Hat Advanced Cluster Security (RHACS) provides continuous security posture and swift recovery from failures or disasters, by addressing critical data loss and operational continuity challenges during Central outages. This empowers customers, particularly those with stringent requirements like financial institutions, to deploy RHACS with confidence in critical production environments and recover swiftly from catastrophic events.

      Problem Statement/Context:
      The current implementation of RHACS lacks robust HA/DR capabilities, with documentation primarily detailing only backup and restore procedures. This presents significant challenges for customers:

      • Customer DR Gap: A customer's initial plan to simply spin up a new cluster and re-introduce secured clusters to a newly built Central is unacceptable due to the resulting loss of historical data from the old database.
      • Cluster Name & Identity Issues: The idea to backup a Central instance and restore it in a new cluster would conceptually work if the secondary cluster maintained the same name. However, cluster DNS often differs, which is assumed to create issues with both certificates and data in the database, as the database contains the original cluster name. 
      • Limited Central Architecture: Unlike other security scanning tools (e.g., Nessus/ACAS), RHACS does not support connecting to more than one Central or a tiered Central architecture. This prevents scenarios where a disconnected site could have a local Central capturing all events  
      • Singleton Deployment Limitations: RHACS Central follows a singleton deployment model without built-in support for automated failover or leader election. This results in operational risk, as Central becomes a single point of failure. Customers cannot deploy multiple Centrals with coordinated failover, making it difficult to achieve real high availability in production environments. In outage scenarios, service continuity depends on manual intervention and recovery, which increases MTTR and jeopardizes security visibility.
      • Build-Time Policy Continuity: If Central is down and roxctl is configured across CI/CD pipelines, all builds will break for the duration of Central being offline. This severely impacts development velocity and requires significant operational overhead during outages.
      • Violations Continuity: Customers need to continue receiving violations to their SIEM or other destinations even when Central is down. This is crucial for ongoing incident response and audit trails.
      • Collector Stability Issues: Some customers have seen specific operational issues they have observed where collectors enter a CrashLooping state, generating false positives and requiring manual silencing of alerts. This impacts the reliability of security alerts and can lead to missing real problems, suggesting a need for increased collector resilience and configurability around startup retries.

      Expected end user outcomes :

      Security and platform administrators will be able to deploy and configure RHACS for high availability, ensuring core security services (e.g., Central, Scanner, Collector communication) maintain operation even if individual components fail. They will also be able to implement effective disaster recovery strategies, including solutions for restoring RHACS instances and their data across clusters with differing names. This is critical for customers who require a robust DR plan to sign off on production deployments.  

      Success Criteria or KPIs measured:{}

      The metrics below are based on assumption that customer will follow best practices and guidance we provide.  

      • Reduction in Mean Time To Recovery (MTTR) for RHACS outages after implementing HA/DR.
      • Percentage of RHACS deployments utilizing supported HA configurations in production.
      • No loss of security events or history reported by collectors during simulated Central disconnection and reconnection events.
      • Zero reported CI/CD pipeline breaks directly attributable to Central downtime after fallback mechanisms are enabled.
      • Customer satisfaction scores related to RHACS uptime, resilience, and alert accuracy.
      • Compliance with defined RTO/RPO targets in test and production environments.

      Use Cases (Optional):

      Include use case diagrams, main success scenarios, alternative flow scenarios together with user type/persona. Initial completion during Refinement status.

      <your text here>

      Out of Scope (Optional):

      High-level list of items that are out of scope. Initial completion during Refinement status.

       ** 

              atelang@redhat.com Anjali Telang
              atelang@redhat.com Anjali Telang
              Doron Caspin
              Anjali Telang Anjali Telang
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: