Uploaded image for project: 'FlightPath'
  1. FlightPath
  2. FLPATH-2834

Resource Optimization plugin displays recommendations for non-existent resources causing workflow failures

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False

      Resource Optimization plugin displays recommendations for containers/workloads that no longer exist on the cluster, causing workflow execution failures when users attempt to apply them.

      Description

      The Resource Optimization plugin UI displays recommendations from the Cost Management API without validating whether the recommended resources (deployments, containers) still exist on the target cluster. When users click "Apply recommendations" for a stale recommendation, the `patch-k8s-resource` workflow attempts to patch a non-existent Kubernetes resource, resulting in HTTP 404 (Not Found) or HTTP 403 (Forbidden) errors.

      The Cost Management API returns recommendations based on historical data that may include resources that have been deleted from the cluster. The plugin does not perform client-side or server-side validation to check resource existence before:

      • Displaying recommendations in the UI
      • Allowing users to click "Apply" button
      • Executing the workflow

      Environment

      • RHDH Version: 1.8 STABLE-RC
      • Resource Optimization Plugin Version: 1.2.1
      • Orchestrator Plugin Version: 1.8.0-rc.3
      • Workflow: patch-k8s-resource (quay.io/orchestrator/serverless-workflow-patch-k8s-resource:latest)
      • Cluster: OpenShift (tested on ocp-edge73-0)
      • Cluster ID: ocp-edge73-0-prq7c
      • Namespace: rhdh-operator

      Steps to Reproduce

      1. Deploy RHDH 1.8 STABLE-RC with Resource Optimization plugin
      2. Navigate to Optimizations tab in Backstage UI
      3. Observe recommendations displayed in the table
      4. Identify a recommendation for a resource that has been deleted from the cluster
        • Example: Recommendation shows namespace=ros-payloads, workload=http-client, container=client
        • Verification: `oc get deployment http-client -n ros-payloads` returns "NotFound"
      5. Click "Apply" button on the stale recommendation
        • Result: Workflow execution fails with HTTP 404 or HTTP 403 error

      Expected Behavior

      1. The plugin should validate resource existence before displaying recommendations
      2. Recommendations for non-existent resources should either:
        • Not be displayed in the UI, OR
        • Be displayed with a warning/disclaimer, OR
        • Have the "Apply" button disabled with a clear message
      3. If a user attempts to apply a recommendation for a non-existent resource:
        • The workflow should validate resource existence before attempting to patch
        • A clear error message should be returned: "Resource {namespace}/{workload} not found on cluster. This recommendation may be stale."
        • The error should be user-friendly and actionable

      Actual Behavior

      1. Stale recommendations are displayed in the UI without any indication they are non-actionable
      2. The "Apply" button is enabled for all recommendations, including stale ones
      3. Clicking "Apply" on a stale recommendation triggers workflow execution
      4. The workflow attempts to PATCH the non-existent resource via Kubernetes API
      5. Workflow fails with HTTP 404 (resource not found) or HTTP 403 (forbidden)
      6. Error messages are technical and not user-friendly

      Error Details

      Workflow Pod Logs

      2025-10-30 20:30:40,348 ERROR [org.jbp.wor.ins.imp.WorkflowProcessInstanceImpl] 
      Unexpected error while executing node patch in process instance 3010435b-3531-4bef-978f-afbc878534f5: 
      org.jbpm.workflow.instance.WorkflowRuntimeException: [patch-k8s-resource:3010435b-3531-4bef-978f-afbc878534f5 - patch:[uuid=10]] -- HTTP 403 Forbidden
      
      Caused by: WorkItemExecutionError [errorCode=404]
      at org.kie.kogito.serverless.workflow.openapi.OpenApiWorkItemHandler.internalExecute(OpenApiWorkItemHandler.java:76)
      

      Workflow Parameters (from error log)

      parameters{
        Parameter={
          "clusterName":"ocp-edge73-0-prq7c",
          "resourceType":"deployment",
          "resourceNamespace":"ros-payloads",
          "resourceName":"http-client",
          "containerName":"client",
          "resourceApiVersion":"apis/apps/v1"
        },
        apiVersion=apis/apps/v1,
        kind=deployments,
        name=http-client,
        namespace=ros-payloads
      }
      

      Cluster Verification

      # Check if namespace exists
      $ oc get namespace ros-payloads
      NAME           STATUS   AGE
      ros-payloads   Active   11m
      
      # Check if deployment exists in target namespace  
      $ oc get deployment http-client -n ros-payloads
      Error from server (NotFound): deployments.apps "http-client" not found
      
      # Check if deployment exists anywhere on cluster
      $ oc get deployments --all-namespaces | grep http-client
      # Result: NOT FOUND
      

      Root Cause Analysis

      The issue stems from a data freshness problem between the Cost Management API and the actual cluster state:

      1. Stale Data Source: Cost Management API returns recommendations based on historical metrics/data that may include resources that no longer exist
      2. No Validation: The Resource Optimization plugin does not validate resource existence before:
          • Displaying recommendations
          • Allowing users to interact with recommendations
          • Executing workflows
      3. Workflow Assumes Existence: The `patch-k8s-resource` workflow attempts to patch resources without first verifying they exist
      4. Poor Error Handling: When resources don't exist, the workflow returns technical HTTP errors (404/403) instead of user-friendly messages

      Impact

      • Severity: Medium-High
      • User Experience: Poor - Users see actionable recommendations that cannot be applied
      • Workflow Reliability: Workflow executions fail unexpectedly for valid-seeming recommendations
      • Data Quality: The plugin presents non-actionable data as actionable
      • Frequency: Depends on how frequently resources are deleted and how often Cost Management API data is refreshed

      Recommended Fixes

      1. Frontend Validation (Quick Win)

      Add client-side validation in the Resource Optimization plugin UI:

      • Before displaying "Apply" button, validate resource exists via Kubernetes API
      • Disable "Apply" button and show warning: "This resource no longer exists on the cluster"
      • Add visual indicator (e.g., icon/warning badge) for potentially stale recommendations

      2. Backend Validation (Preferred)

      Add server-side validation in the workflow execution:

      • Before attempting PATCH, verify resource exists via Kubernetes API GET request
      • Return clear error message if resource doesn't exist: "Resource {namespace}/{workload} not found. This recommendation may be stale."
      • Handle 404/403 errors gracefully with user-friendly messages

      3. API-Level Filtering (Long-term)

      Work with Cost Management API team to:

      • Filter out recommendations for resources that no longer exist
      • Add "resource_exists" or "resource_status" field to recommendation response
      • Implement time-based expiration for recommendations (e.g., resources deleted > 7 days ago)

      4. Workflow Enhancement

      Update `patch-k8s-resource` workflow to:

      • Add pre-flight validation step to check resource existence
      • Handle missing resources gracefully with informative error messages
      • Optionally auto-skip/disable recommendations for non-existent resources

      Workaround

      1. Before applying recommendations, manually verify resources exist:
        • `oc get deployment {workload} -n {namespace}`
        • `oc get {resourceType} {workload} -n {namespace}`
      2. If resource doesn't exist, skip that recommendation
      3. Wait for Cost Management API data refresh cycle for stale recommendations to be filtered out

      Additional Notes

      • This issue highlights a broader data quality/freshness problem between Cost Management API and cluster state
      • Similar issues may affect other resource types (StatefulSets, DaemonSets, etc.)
      • The workflow failure mode (404/403) may vary depending on:
          • Whether the namespace exists
          • Whether the resource type exists
          • RBAC permissions when checking non-existent resources
      • Investigation revealed the API URL configuration issue was separate (needed Kubernetes API server, not Backstage URL)
      • Investigation also revealed Kie Flyway database migration issue was separate (missing `kie.flyway.enabled=true` property)

      Related Issues

      • FLPATH-2832: Cost Management proxy timeout issues
      • FLPATH-2833: Missing correlation_instances table (Kie Flyway database migration issue)

              ydayagi yaron dayagi
              gharden1 Gary Harden
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: