-
Feature Request
-
Resolution: Unresolved
-
Undefined
-
None
-
2.6
-
False
-
-
False
What is the nature and description of the request?
The aap-gateway-manage migrate_service_data command currently runs on every gateway operator reconciliation cycle. This task attempts to connect to all registered backend services (controller, hub, eda) to sync organization, team, user, and role data into the gateway database.
Current behavior:
- Task runs on every reconciliation (pod restart, deployment update, operator restart, etc.)
- Task requires ALL backend services to be reachable via localhost:8000 (through Envoy proxy)
- If any backend is temporarily unavailable, the task fails with "Connection refused"
- Failed task blocks gateway deployment with several retries, preventing the gateway from becoming ready
- Even after days of retrying, the task does not self-heal if the root condition persists
Requested behavior:
- Task should only run when actually needed:
- First deployment (fresh install)
- New service registered in ServiceCluster table
- Service configuration changes (routes, nodes, etc.)
- Manual trigger via annotation or CR field
- Task should be idempotent - if data is already migrated, skip gracefully
- Task should handle partial failures - if one service is unavailable, migrate others and retry the failed one later
- Task failure should not block gateway readiness for previously-working deployments
Why does the customer need this? (List the business requirements here)
1. Gateway resilience during routine operations
- Customer performs rolling restarts of gateway deployments for maintenance
- Current behavior: Any gateway restart triggers migrate_service_data, which fails if backend services have any
latency/availability issues
- Impact: Gateway becomes unavailable until all backend services are perfectly reachable
2. External database deployments are affected
- Customer uses external PostgreSQL databases accessed through firewalls
- Database latency causes LDS (Listener Discovery Service) generation to exceed Envoy's timeout
- When LDS fails, Envoy can't route traffic to localhost:8000, causing migrate_service_data to fail with "Connection
refused"
- This creates a cascading failure: slow DB → LDS timeout → no routing → migrate_service_data fails → gateway blocked
3. Reproducible on all instances
- Customer has multiple AAP 2.6 instances (aap-primary-26, aap-andrew-rpm-import-26, etc.)
- Issue is 100% reproducible on every instance, including brand new deployments
- Blocking internal testing and production readiness
4. No current workaround
- Customer's only recovery path is to manually delete ServiceCluster entries for problematic services
- This removes functionality (e.g., hub access) rather than fixing the root cause
- Workaround must be repeated after every gateway restart
How would you like to achieve this? (List the functional requirements here)
1. Add idempotency checks to migrate_service_data
- Before running, check if migration is actually needed
- Compare ServiceCluster timestamps with last successful migration
- Skip if no changes detected
2. Add conditional execution logic to operator reconciliation
- Track migration state in CR status or annotation
- Only trigger migration on:
- status.migrationRequired: true
- New ServiceCluster entries added
- CR annotation aap.ansible.com/force-migration: "true"
3. Implement graceful degradation
- If a backend service is unavailable, migrate available services
- Mark unavailable services for retry on next reconciliation
- Don't block gateway readiness for partial migration failures
4. Add timeout/circuit breaker
- If migration fails N times consecutively, stop retrying automatically
- Log warning and set CR status condition
- Require manual intervention (annotation) to retry
5. Consider separating migration from reconciliation
- Run migration as a Kubernetes Job instead of inline task
- Job can have its own retry/backoff logic
- Gateway deployment proceeds independently
List any affected known dependencies: Doc, UI etc..
- aap-gateway-operator: Primary component affected
- aap_gateway_api (migrate_service_data.py): Management command implementation
- ansible_base.resource_registry.rest_client: Service communication layer
- Documentation: Operator troubleshooting guide may need updates
- No UI changes expected