Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-26524

[gitops] Refactor the HealthCheck and Wait Job to match the official documentation during deployment

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Undefined Undefined
    • None
    • None
    • GitOps
    • None
    • RHOS Upgrades 2025 Sprint 22
    • 1

      Goal

      As a GitOps engineer deploying RHOSO18, I want the ArgoCD HealthChecks and Wait Jobs to align with the official Red Hat OpenStack Services on OpenShift documentation, so that our deployment process follows supported patterns and reduces troubleshooting complexity when issues arise. A key objective is to eliminate the custom transient error handling complexity by using official wait conditions, making any transient errors an upstream openstack-operator bug that we shouldn't work around in our GitOps implementation.

      Context

      The current GitOps implementation has several blocking mechanisms and health checks that deviate from the official RHOSO18 deployment documentation. These inconsistencies create maintenance overhead, potential confusion when comparing our automated deployment with manual procedures, and unnecessary complexity in handling transient states that should be managed by the upstream operators themselves.

      Acceptance Criteria

      1. OpenStack Operators Installation Wait Condition Alignment

      • Refactor approve-installplan job in manifests/operators/openstack/components/approve-installplan/job.yaml
        • Replace the wait_for_crd() function's current wait condition:
          oc get crd openstacks.operator.openstack.org -o jsonpath='{.status.conditions[?(@.type=="Established")].status}' | grep -q True
          
        • Implement the official documentation wait condition:
          oc wait csv -n openstack-operators -l operators.coreos.com/openstack-operator.openstack-operators="" --for jsonpath='{.status.phase}'=Succeeded
          
        • Update the function name from wait_for_crd() to wait_for_csv() to reflect the new behavior
        • Verify the job completes successfully with the new wait condition in both initial installation and update scenarios

      2. OpenStack Control Plane Health Check Message Alignment

      • Update ArgoCD HealthCheck for OpenStackControlPlane in manifests/operations/gitops-bootstrap/enable/argocd.yaml
        • Current behavior: Waits for Ready condition status to be True with custom transient error handling
        • Required behavior: Use official documentation approach checking for "Setup complete" message
        • Remove custom transient error handling logic (the complex succeeded_count < 2 workaround)
        • Implement simple, official documentation-aligned health check
        • Investigate how to access the MESSAGE column equivalent (likely obj.status.message or similar field)

      3. Data Plane Health Check Investigation and Optimization

      • Research official documentation section 5.7 "Data plane conditions and states"
      • Simplify ArgoCD health checks for:
        • OpenStackDataPlaneNodeSet: Remove custom transient error handling, use official wait patterns
        • OpenStackDataPlaneDeployment: Remove custom transient error handling, use official wait patterns
      • Evaluate alternative health check approaches:
        • For OpenStackDataPlaneNodeSet: Test SetupReady condition vs current Ready condition
        • For OpenStackDataPlaneDeployment: Test Deployed status field vs current Ready condition
        • Document pros/cons of each approach regarding reliability and official documentation alignment
      • Implement simplified approach that removes custom workarounds and follows official patterns

      4. Documentation Updates

      • Remove all TODO items from docs/argocd-application-deployment.md
      • Update code examples in documentation to match implemented changes
      • Add rationale sections explaining why specific approaches were chosen
      • Update debugging procedures to reference official OpenStack operator troubleshooting instead of custom workarounds
      • Document the removal of transient error handling and explain that such issues should be reported as upstream bugs

      5. Integration and Compatibility

      • Maintain existing functionality: Ensure update scenarios (detected by existing OpenStack CR) continue to work
      • Remove transient error handling complexity: Eliminate custom workarounds for temporary "Degraded" states, relying on official operator behavior instead

      6. Testing and Validation

      • Create deployment state capture mechanism: After each blocking state, capture the associated ArgoCD application state and the status of all its components
      • Implement GitLab artifact collection: Send captured state artifacts to GitLab for analysis and future automated testing capabilities
      • Integration testing: Verify complete deployment sequence works with all refactored components using the new state capture mechanism
      • Regression testing: Confirm both initial installation and update scenarios function correctly
      • Documentation validation: Ensure all code examples in docs match actual implementation

      Definition of Done

      • All health checks and wait jobs align with official RHOSO18 documentation patterns
      • Custom transient error handling complexity is removed from ArgoCD health checks
      • All TODO items are resolved and removed from documentation
      • Deployment sequence completes successfully for both initial installation and updates
      • State capture and artifact collection mechanism is implemented and functional
      • Code review completed and approved
      • Integration tests pass with new implementations and generate proper artifacts
      • Documentation is updated and consistent with implementation

              sathlang@redhat.com Sofer Athlan Guyot
              sathlang@redhat.com Sofer Athlan Guyot
              rhos-dfg-upgrades
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: