-
Bug
-
Resolution: Done
-
Major
-
None
-
None
-
Quality / Stability / Reliability
-
1
-
False
-
-
False
-
-
-
GH Train-35
-
None
Description of problem:
When manager tries to insert event data at the beginning of a new month, it fails with a partition not found error. This occurs when the operator restarts between the month boundary (e.g., Dec 28-31) after the data retention cron job has created the next month's partition.
The affected tables include:
- event.local_policies
- event.local_root_policies
- event.managed_clusters
- history.local_compliance
Version-Release number of selected component (if applicable):
Global Hub 1.7.0
How reproducible:
Reproducible when operator restarts occur between the cron job execution (28th of month) and the end of month, followed by data insertion in the new month.
Steps to Reproduce:
- Wait for data retention cron job to execute on Dec 28, 00:00 (creates 2026_01 partition)
- Restart operator between Dec 29-31 (operator only recreates 2025_12 and 2025_11 partitions, 2026_01 is lost)
- Wait for month rollover to January
- Cron job executes on Jan 1, 00:00 (creates 2026_02 but does not check for 2026_01)
- Attempt to insert event data with created_at timestamp in January
Actual results:
Manager fails to insert event data with error:
2026/01/05 03:43:45 /workspace/manager/pkg/status/handlers/policy/local_replicated_policy_event_handler.go:122 pq: no partition of relation "local_policies" found for row [2.912ms] [rows:0] INSERT INTO "event"."local_policies" ("event_name","event_namespace","policy_id","message","leaf_hub_name","reason","count","source","compliance","cluster_id","cluster_name","created_at") VALUES ('policies.policy-subscriptions.1887b6f55cb89eaf','local-cluster','f830746d-9308-47c9-a1e1-efeb8790fb42',...,'2026-01-05 03:09:03') 2026-01-05T03:43:45.755Z WARN workerpool/worker.go:131 failed to handle event (event.localreplicatedpolicy): failed handling leaf hub LocalPolicyStatusEvent event - pq: no partition of relation "local_policies" found for row
The current month partition (2026_01) is missing from the database.
Expected results:
The current month partition should exist and data insertion should succeed. The system should handle operator restarts near month boundaries gracefully without losing partition tables.
Additional info:
Root Cause:
The system has a design flaw in partition management with two components creating partitions:
1. Operator Initialization: Creates current month + previous month partitions (only on operator start/restart)
2. Data Retention Cron Job: Creates next month partition only (executes 1st, 15th, 28th at 00:00)
When operator restarts between Dec 28-31, it recreates only 2025_12 and 2025_11, potentially losing the 2026_01 partition created by the cron job.
Solution Implemented:
1. Added ensurePartitionExists() function to check and create current month partition if missing
2. Modified data retention job to ensure current month partition exists before creating next month
3. Updated manager startup to always run data-retention job, ensuring partition integrity on every restart
PR: https://github.com/stolostron/multicluster-global-hub/pull/2221
🤖 Generated with Claude Code
- is related to
-
ACM-27978 Use PostgreSQL trigger to replace data retention job logic
-
- New
-