Uploaded image for project: 'Hybrid Cloud Console'
  1. Hybrid Cloud Console
  2. RHCLOUD-45639

Potential Schema Improvements to Reduce Serialization Errors

XMLWordPrintable

    • Product / Portfolio Work
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • Unset
    • None

      Follow up from investigating serialization errors in RHCLOUD-45441

      The migration to using WAL direclty vs the outbox table in RHCLOUD-45638  is estimated to reduce serialization errors by 75-80% but we still saw serialization errors across other tables that the WAL change wont fix

      While investigating, Claude provided some potential changes we could make to help reduce our errors even more. It seemed worth looking into these changes and understanding if they would improve our service without impacting performance or consistency.

      ---------------------------

      Schema Analysis for SSI Optimization

      reporter_resources (19 errors - well-indexed)

      Current:

      Indexes:
          "reporter_resources_pkey" PRIMARY KEY, btree (id)
          "reporter_resource_key_idx" UNIQUE, btree (local_resource_id, reporter_type,
              resource_type, reporter_instance_id, representation_version, generation)
          "reporter_resource_resource_id_idx" btree (resource_id)
          "reporter_resource_search_idx" btree (local_resource_id, reporter_type,
              resource_type, reporter_instance_id)
      

      Assessment: This table is well-indexed. Your `FindResourceByKeys` query uses `reporter_resource_search_idx` for the initial lookup and `reporter_resource_resource_id_idx` for the self-join. These narrow the predicate locks to specific index entries rather than full-table scans. Only 19 errors came from this table - no changes needed.

      reporter_representations (80 errors)

      Current:

      Indexes:
          "reporter_representations_pkey" PRIMARY KEY, btree (reporter_resource_id, version, generation)
          "ux_reporter_reps_txid_nn" UNIQUE, btree (transaction_id)
              WHERE transaction_id IS NOT NULL AND transaction_id::text <> ''::text
      Foreign-key constraints:
          FK reporter_resource_id → reporter_resources(id) ON DELETE CASCADE
      

      Issues:

      • The partial unique index on `transaction_id` is used by `HasTransactionIdBeenProcessed`. When checking `EXISTS (SELECT 1 FROM reporter_representations WHERE transaction_id = ?)`, SSI places a SIRead lock on the index. If the transaction_id doesn't exist yet, PostgreSQL creates a gap lock on the index, which can conflict with concurrent INSERTs that create new transaction_id entries.
      • The CASCADE DELETE from reporter_resources means deleting a reporter_resource triggers reads on this table's FK index, widening the SSI dependency graph.

      Assessment: 72 of the 80 errors are "identification as a pivot, during conflict in checking" - consistent with the gap lock pattern on the transaction_id index. 

      common_representations (140 errors)

      Current:

      Indexes:
          "common_representations_pkey" PRIMARY KEY, btree (resource_id, version)
          "ux_common_reps_txid_nn" UNIQUE, btree (transaction_id)
              WHERE transaction_id IS NOT NULL AND transaction_id::text <> ''::text
      

      Issues:

      • Same gap lock problem on the transaction_id partial unique index as reporter_representations.
      • All 140 errors are "identification as a pivot, during conflict in checking" - the `HasTransactionIdBeenProcessed` check reads this index, and concurrent INSERTs write to it, creating the dependency cycle.
      • No foreign key to resource table - this means no additional FK constraint checking during INSERTs, which is good for SSI.

      Assessment: The 140 errors here are likely caused by the combination of:
      1. `HasTransactionIdBeenProcessed` reading the transaction_id index (SIRead lock on gap)
      2. A concurrent transaction inserting a new row with a different transaction_id (write into the same index range)

      resource (3 errors - fine)

      Current:

      Indexes:
          "resource_pkey" PRIMARY KEY, btree (id)
      Check constraints:
          "chk_resource_common_version" CHECK (common_version >= 0)
      

      Assessment: Only 3 errors. The PK index provides sufficient granularity. No changes needed.

              anatale.openshift Antony Natale
              anatale.openshift Antony Natale
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: