Uploaded image for project: 'Hybrid Cloud Console'
  1. Hybrid Cloud Console
  2. RHCLOUD-35738

In order to ensure replication is reliable, monitor for end to end replication lag

XMLWordPrintable

    • 8
    • False
    • Hide

      None

      Show
      None
    • False
    • Unset
    • CRCPLAN-232 - Kessel | PRBAC v2 Service Provider Migration Enablement (Internal)
    • None
    • Access & Management Sprint 98, Access & Management Sprint 99, Access & Management Sprint 100, ReBAC Tech Debt Sprint Q4 2025

      Goal:

      • We need to be alerted when there is a substantial delay between a replication event being written to the outbox and its tuples being written to SpiceDB where "substantial" is TBD but somewhere in the range of many seconds to a minute.
      • Must be able to tell if there is a delay at ANY part of the process between RBAC and SpiceDB (i.e. it is not acceptable just to measure from Kafka to SpiceDb etc)

      One possible idea:

      • Add a counter to RBAC which increments for each event that is added to the outbox (the OutboxReplicator would be a sensible catch-all for this)
      • Add or reuse a counter (if one already exists) on the sink connector for each event that is replicated to relations
      • Add an SLO for the delta between the rates in those two counters. If the difference is too high, alert.
        • What window to use?

      Another one

      • Keep events in outbox (don't remove them after adding them)
      • Add an acknowledge endpoint to RBAC which accepts one outbox event identifier and removes this from the outbox table
      • Call this API from the sink connector after successfully processing a message
      • Add a gauge for the number of outbox events in the table
      • Alert if the gauge value is too high

      https://docs.google.com/document/d/13kKwkmmyA2eWpdPngslZ2ZVLKztuJhLUpghD-jVQFR0/edit?tab=t.0#heading=h.7yzqekwoqya5

              mmclaugh@redhat.com Mark McLaughlin
              rhit-ahenning Alec Henninger
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: