Loading...

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: 1.8
Affects Version/s: None
Component/s: optimization-plugin, optimization-plugin-qe
Labels:
- qe
- triaged

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Resource Optimization plugin displays recommendations for containers/workloads that no longer exist on the cluster, causing workflow execution failures when users attempt to apply them.

Description

The Resource Optimization plugin UI displays recommendations from the Cost Management API without validating whether the recommended resources (deployments, containers) still exist on the target cluster. When users click "Apply recommendations" for a stale recommendation, the `patch-k8s-resource` workflow attempts to patch a non-existent Kubernetes resource, resulting in HTTP 404 (Not Found) or HTTP 403 (Forbidden) errors.

The Cost Management API returns recommendations based on historical data that may include resources that have been deleted from the cluster. The plugin does not perform client-side or server-side validation to check resource existence before:

Displaying recommendations in the UI
Allowing users to click "Apply" button
Executing the workflow

Environment

RHDH Version: 1.8 STABLE-RC
Resource Optimization Plugin Version: 1.2.1
Orchestrator Plugin Version: 1.8.0-rc.3
Workflow: patch-k8s-resource (quay.io/orchestrator/serverless-workflow-patch-k8s-resource:latest)
Cluster: OpenShift (tested on ocp-edge73-0)
Cluster ID: ocp-edge73-0-prq7c
Namespace: rhdh-operator

Steps to Reproduce

Deploy RHDH 1.8 STABLE-RC with Resource Optimization plugin
Navigate to Optimizations tab in Backstage UI
Observe recommendations displayed in the table
Identify a recommendation for a resource that has been deleted from the cluster
- Example: Recommendation shows namespace=ros-payloads, workload=http-client, container=client
- Verification: `oc get deployment http-client -n ros-payloads` returns "NotFound"
Click "Apply" button on the stale recommendation
- Result: Workflow execution fails with HTTP 404 or HTTP 403 error

Expected Behavior

The plugin should validate resource existence before displaying recommendations
Recommendations for non-existent resources should either:
- Not be displayed in the UI, OR
- Be displayed with a warning/disclaimer, OR
- Have the "Apply" button disabled with a clear message
If a user attempts to apply a recommendation for a non-existent resource:
- The workflow should validate resource existence before attempting to patch
- A clear error message should be returned: "Resource {namespace}/{workload} not found on cluster. This recommendation may be stale."
- The error should be user-friendly and actionable

Actual Behavior

Stale recommendations are displayed in the UI without any indication they are non-actionable
The "Apply" button is enabled for all recommendations, including stale ones
Clicking "Apply" on a stale recommendation triggers workflow execution
The workflow attempts to PATCH the non-existent resource via Kubernetes API
Workflow fails with HTTP 404 (resource not found) or HTTP 403 (forbidden)
Error messages are technical and not user-friendly

Error Details

Workflow Pod Logs

2025-10-30 20:30:40,348 ERROR [org.jbp.wor.ins.imp.WorkflowProcessInstanceImpl] 
Unexpected error while executing node patch in process instance 3010435b-3531-4bef-978f-afbc878534f5: 
org.jbpm.workflow.instance.WorkflowRuntimeException: [patch-k8s-resource:3010435b-3531-4bef-978f-afbc878534f5 - patch:[uuid=10]] -- HTTP 403 Forbidden

Caused by: WorkItemExecutionError [errorCode=404]
at org.kie.kogito.serverless.workflow.openapi.OpenApiWorkItemHandler.internalExecute(OpenApiWorkItemHandler.java:76)

Workflow Parameters (from error log)

parameters{
  Parameter={
    "clusterName":"ocp-edge73-0-prq7c",
    "resourceType":"deployment",
    "resourceNamespace":"ros-payloads",
    "resourceName":"http-client",
    "containerName":"client",
    "resourceApiVersion":"apis/apps/v1"
  },
  apiVersion=apis/apps/v1,
  kind=deployments,
  name=http-client,
  namespace=ros-payloads
}

Cluster Verification

# Check if namespace exists
$ oc get namespace ros-payloads
NAME           STATUS   AGE
ros-payloads   Active   11m

# Check if deployment exists in target namespace  
$ oc get deployment http-client -n ros-payloads
Error from server (NotFound): deployments.apps "http-client" not found

# Check if deployment exists anywhere on cluster
$ oc get deployments --all-namespaces | grep http-client
# Result: NOT FOUND

Root Cause Analysis

The issue stems from a data freshness problem between the Cost Management API and the actual cluster state:

Stale Data Source: Cost Management API returns recommendations based on historical metrics/data that may include resources that no longer exist
No Validation: The Resource Optimization plugin does not validate resource existence before:
- - Displaying recommendations
  - Allowing users to interact with recommendations
  - Executing workflows
Workflow Assumes Existence: The `patch-k8s-resource` workflow attempts to patch resources without first verifying they exist
Poor Error Handling: When resources don't exist, the workflow returns technical HTTP errors (404/403) instead of user-friendly messages

Impact

Severity: Medium-High
User Experience: Poor - Users see actionable recommendations that cannot be applied
Workflow Reliability: Workflow executions fail unexpectedly for valid-seeming recommendations
Data Quality: The plugin presents non-actionable data as actionable
Frequency: Depends on how frequently resources are deleted and how often Cost Management API data is refreshed

Recommended Fixes

1. Frontend Validation (Quick Win)

Add client-side validation in the Resource Optimization plugin UI:

Before displaying "Apply" button, validate resource exists via Kubernetes API
Disable "Apply" button and show warning: "This resource no longer exists on the cluster"
Add visual indicator (e.g., icon/warning badge) for potentially stale recommendations

2. Backend Validation (Preferred)

Add server-side validation in the workflow execution:

Before attempting PATCH, verify resource exists via Kubernetes API GET request
Return clear error message if resource doesn't exist: "Resource {namespace}/{workload} not found. This recommendation may be stale."
Handle 404/403 errors gracefully with user-friendly messages

3. API-Level Filtering (Long-term)

Work with Cost Management API team to:

Filter out recommendations for resources that no longer exist
Add "resource_exists" or "resource_status" field to recommendation response
Implement time-based expiration for recommendations (e.g., resources deleted > 7 days ago)

4. Workflow Enhancement

Update `patch-k8s-resource` workflow to:

Add pre-flight validation step to check resource existence
Handle missing resources gracefully with informative error messages
Optionally auto-skip/disable recommendations for non-existent resources

Workaround

Before applying recommendations, manually verify resources exist:
- `oc get deployment {workload} -n {namespace}`
- `oc get {resourceType} {workload} -n {namespace}`
If resource doesn't exist, skip that recommendation
Wait for Cost Management API data refresh cycle for stale recommendations to be filtered out

Additional Notes

This issue highlights a broader data quality/freshness problem between Cost Management API and cluster state
Similar issues may affect other resource types (StatefulSets, DaemonSets, etc.)
The workflow failure mode (404/403) may vary depending on:
- - Whether the namespace exists
  - Whether the resource type exists
  - RBAC permissions when checking non-existent resources
Investigation revealed the API URL configuration issue was separate (needed Kubernetes API server, not Backstage URL)
Investigation also revealed Kie Flyway database migration issue was separate (missing `kie.flyway.enabled=true` property)

Related Issues

FLPATH-2832: Cost Management proxy timeout issues
~~FLPATH-2833~~: Missing correlation_instances table (Kie Flyway database migration issue)

Details

Description

Description

Environment

Steps to Reproduce

Expected Behavior

Actual Behavior

Error Details

Workflow Pod Logs

Workflow Parameters (from error log)

Cluster Verification

Root Cause Analysis

Impact

Recommended Fixes

1. Frontend Validation (Quick Win)

2. Backend Validation (Preferred)

3. API-Level Filtering (Long-term)

4. Workflow Enhancement

Workaround

Additional Notes

Related Issues

Attachments

Easy Agile Planning Poker

Activity

People

Dates