Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-27968

Fix partition table missing on month rollover

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • Global Hub
    • Quality / Stability / Reliability
    • 1
    • False
    • Hide

      None

      Show
      None
    • False
    • GH Train-35
    • None

      Description of problem:

      When manager tries to insert event data at the beginning of a new month, it fails with a partition not found error. This occurs when the operator restarts between the month boundary (e.g., Dec 28-31) after the data retention cron job has created the next month's partition.

      The affected tables include:
      - event.local_policies
      - event.local_root_policies
      - event.managed_clusters
      - history.local_compliance

      Version-Release number of selected component (if applicable):

      Global Hub 1.7.0

      How reproducible:

      Reproducible when operator restarts occur between the cron job execution (28th of month) and the end of month, followed by data insertion in the new month.

      Steps to Reproduce:

      1. Wait for data retention cron job to execute on Dec 28, 00:00 (creates 2026_01 partition)
      2. Restart operator between Dec 29-31 (operator only recreates 2025_12 and 2025_11 partitions, 2026_01 is lost)
      3. Wait for month rollover to January
      4. Cron job executes on Jan 1, 00:00 (creates 2026_02 but does not check for 2026_01)
      5. Attempt to insert event data with created_at timestamp in January

      Actual results:

      Manager fails to insert event data with error:

      2026/01/05 03:43:45 /workspace/manager/pkg/status/handlers/policy/local_replicated_policy_event_handler.go:122 
      pq: no partition of relation "local_policies" found for row
      
      [2.912ms] [rows:0] INSERT INTO "event"."local_policies" 
      ("event_name","event_namespace","policy_id","message","leaf_hub_name","reason","count","source","compliance","cluster_id","cluster_name","created_at") 
      VALUES ('policies.policy-subscriptions.1887b6f55cb89eaf','local-cluster','f830746d-9308-47c9-a1e1-efeb8790fb42',...,'2026-01-05 03:09:03')
      
      2026-01-05T03:43:45.755Z WARN workerpool/worker.go:131 
      failed to handle event (event.localreplicatedpolicy): failed handling leaf hub LocalPolicyStatusEvent event - 
      pq: no partition of relation "local_policies" found for row
      

      The current month partition (2026_01) is missing from the database.

      Expected results:

      The current month partition should exist and data insertion should succeed. The system should handle operator restarts near month boundaries gracefully without losing partition tables.

      Additional info:

      Root Cause:
      The system has a design flaw in partition management with two components creating partitions:
      1. Operator Initialization: Creates current month + previous month partitions (only on operator start/restart)
      2. Data Retention Cron Job: Creates next month partition only (executes 1st, 15th, 28th at 00:00)

      When operator restarts between Dec 28-31, it recreates only 2025_12 and 2025_11, potentially losing the 2026_01 partition created by the cron job.

      Solution Implemented:
      1. Added ensurePartitionExists() function to check and create current month partition if missing
      2. Modified data retention job to ensure current month partition exists before creating next month
      3. Updated manager startup to always run data-retention job, ensuring partition integrity on every restart

      PR: https://github.com/stolostron/multicluster-global-hub/pull/2221

      🤖 Generated with Claude Code

              rh-ee-myan Meng Yan
              rh-ee-myan Meng Yan
              Yaheng Liu Yaheng Liu
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: