Uploaded image for project: 'FlightPath'
  1. FlightPath
  2. FLPATH-2917

ros-ocp-backend processor rejects all CSV files from Cost Management Operator 4.2.0 due to GPU metrics schema mismatch

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False

      Summary

      ros-ocp-backend processor rejects all CSV files from Cost Management Operator due to schema mismatch - operator generates 49-column CSVs with GPU metrics, processor expects exact 37-column match.

      Environment

      • OpenShift Version: 4.18.26 (Kubernetes v1.31.13)
      • Helm Chart: ros-ocp-0.1.8 (revision 2)
      • Cost Management Operator Version: 4.2.0
      • Cost Management Operator Image: registry.redhat.io/costmanagement/costmanagement-metrics-rhel9-operator@sha256:9fd693b52c8927d8f2afbe6d8bb0580d425ed96ab7ae56a7b4251b9ab3e1113a
      • ros-ocp-backend Version: insights-onprem/ros-ocp-backend (main branch, pre-fix)
      • IOP Chart Version: 0.2.0
      • Deployment: On-premise OpenShift deployment

      Description

      The ros-ocp-backend processor service fails to process ALL CSV files uploaded by the Cost Management Operator, resulting in no experiments being created in Kruize and no optimization recommendations being generated.

      Root Cause

      The Cost Management Operator (version 4.2.0) has been updated to include GPU/accelerator metrics support, generating CSV files with 49 columns including:

      • 37 original columns (CPU, memory, container metadata)
      • 12 new GPU/accelerator columns:
        • accelerator_model_name
        • accelerator_profile_name
        • accelerator_core_usage_percentage_ {min,max,avg}
          ** accelerator_memory_copy_percentage_{min,max,avg}
        • accelerator_frame_buffer_usage_ {min,max,avg}
        • cpu_throttle_container_min (additional CPU metric)

      The ros-ocp-backend processor validation code uses strict exact column matching instead of checking if all required columns are present. This causes all 49-column CSV files to fail validation with error:

      CSV file does not have all the required columns
      

      Problematic Code

      File: internal/utils/aggregator.go
      Function: check_if_all_required_columns_in_CSV()

      func check_if_all_required_columns_in_CSV(df dataframe.DataFrame) error {
          all_required_columns := make([]string, 0, len(types.CSVColumnMapping))
          for k := range types.CSVColumnMapping {
              all_required_columns = append(all_required_columns, k)
          }
      
          columns_in_csv := df.Names()
          if !elementsMatch(all_required_columns, columns_in_csv) {  // EXACT MATCH - FAILS WITH EXTRA COLUMNS
              return fmt.Errorf("CSV file does not have all the required columns")
          }
          return nil
      }
      

      The elementsMatch() function requires CSV to have exactly 37 columns - no more, no less. Extra columns (GPU metrics) cause immediate rejection.

      Impact

      • Severity: Blocker for production deployments
      • Symptoms:
        • Processor consumes Kafka messages successfully
        • ALL CSV files fail validation (both namespace and container CSVs)
        • NO experiments created in Kruize beyond test data
        • NO optimization recommendations generated
        • Complete failure of Resource Optimization Service functionality
      • User Impact: Resource Optimization Service is completely non-functional - users cannot receive workload optimization recommendations

      Steps to Reproduce

      1. Deploy insights-on-prem (IOP-POC-0.2) using helm chart ros-ocp-0.1.8 with Cost Management Operator 4.2.0 on OpenShift 4.18.26
      2. Enable User Workload Monitoring in OpenShift
      3. Wait for hourly metrics collection from Cost Management Operator
      4. Operator uploads CSV files with 49 columns to ingress service
      5. Monitor ros-ocp-backend processor logs:
        oc logs -n cost-onprem -l app.kubernetes.io/component=processor -c rosocp-processor --tail=50
        
      6. Expected: CSV files processed, experiments created
      7. Actual: All CSV files fail with "CSV file does not have all the required columns" error

      Workaround / Fix Applied

      Upstream Fix Available

      The upstream repository RedHatInsights/ros-ocp-backend has already fixed this issue:

      • Commit: b8f1bfcea1bdd9febb29b795da41fab988f132f7
      • Author: saltgen (Sagnik Dutta)
      • Date: October 15, 2025
      • Title: RHINENG-21378 Ignore additional columns in CSV
      • Link: GitHub Commit

      Fix Details

      The fix changes validation from "exact column match" to "contains all required columns":

      func check_if_all_required_columns_in_CSV(df dataframe.DataFrame) error {
          all_required_columns := make([]string, 0, len(types.CSVColumnMapping))
          for k := range types.CSVColumnMapping {
              all_required_columns = append(all_required_columns, k)
          }
      
          columns_in_csv := df.Names()
          if hasMissingColumns(all_required_columns, columns_in_csv) {  // CONTAINS CHECK - ALLOWS EXTRA COLUMNS
              return fmt.Errorf("CSV file does not have all the required columns")
          }
          return nil
      }
      
      func hasMissingColumns(requiredColumns []string, csvColumns []string) bool {
          slices.Sort(requiredColumns)
          for _, reqCol := range requiredColumns {
              if !slices.Contains(csvColumns, reqCol) {
                  log.Warnf("missing columns in CSV: %v", reqCol)
                  return true
              }
          }
          return false
      }
      

      Changes:

      1. Replace elementsMatch() with hasMissingColumns()
      2. Add slices import for Go 1.21+ slice utilities
      3. Check if CSV contains all required columns instead of exact match
      4. Extra columns (GPU metrics) are now ignored successfully

      Applied to insights-onprem Fork

      Applied fix to chadcrum/ros-ocp-backend:

      • Commit: 28c9d2a
      • Date: November 26, 2025
      • Branch: main

      Additional Fix: Changed error handling from return to continue in internal/services/report_processor.go to ensure both CSV files (namespace and container) are processed even if one fails.

      Verification

      After applying fix and deploying updated container image quay.io/chadcrum0/ros-ocp-backend:latest:

      • ✓ CSV files process successfully without column validation errors
      • 12 new experiments created for real cluster workloads:
        • 9 Deployments (processor, ingress, api, kruize, authorino, recommendation-poller, sources-api, housekeeper, valkey)
        • 3 StatefulSets (ros-db, kruize-db, sources-db)
      • ✓ Kruize receiving workload metrics and generating recommendations
      • ✓ Resource Optimization Service fully functional

      Recommendation for On-Prem Repository

      The insights-onprem/ros-ocp-backend repository needs to pull this fix from upstream RedHatInsights/ros-ocp-backend to ensure compatibility with Cost Management Operator 4.2.0 and future versions.

      Files Modified

      1. internal/utils/aggregator.go
        1. Add slices import
        2. Replace elementsMatch() with hasMissingColumns()
        3. Add hasMissingColumns() helper function
      2. internal/services/report_processor.go
        1. Change return to continue on CSV read errors (lines 72, 81)

      Why This Matters

      • Forward Compatibility: Processor now works with future CSV schema additions
      • Backward Compatibility: Still works with old 37-column CSVs
      • GPU Support Ready: When GPU recommendations are needed, just update types.CSVColumnMapping
      • Production Ready: Eliminates complete service failure when operator schema evolves

      Testing Notes

      Test script cost-onprem-chart/scripts/test-ocp-dataflow-jwt.sh generates old 37-column CSVs, which is why initial testing appeared successful. Real operator generates 49-column CSVs, exposing the issue only in production-like deployments.

      Recommendation: Update test script to generate 49-column CSVs matching current operator schema.

      Related Documentation

      • Upstream Jira: RHINENG-21378

              rh-ee-masayag Moti Asayag
              chadcrum Chad Crum
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: