Uploaded image for project: 'FlightPath'
  1. FlightPath
  2. FLPATH-2955

CMMO Upload Failure Loop: insights-ros-ingress Rejects Uploads Without ROS Files During Initial Deployment

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • v0.2.1
    • insights-on-prem

      Issue Summary

      The Cost Management Metrics Operator (CMMO) enters an infinite failure loop when uploading to the insights-ros-ingress service during initial deployment or when ROS metrics are temporarily unavailable. This occurs because the ingress service has a hard validation requirement for resource_optimization_files in the upload manifest, but CMMO may create tarballs before sufficient ROS metrics have been collected by Prometheus.

      Severity

      High - Blocks all data uploads (both Cost Management and ROS) until manually resolved.

      Affected Components

      • Cost Management Metrics Operator (CMMO): Version 4.2.0+
      • insights-ros-ingress: All versions (validation introduced in initial commit fd7dfb550 on 2025-09-03)
      • Prometheus/ServiceMonitors: User workload monitoring must be enabled

      Root Cause

      The ingress service has a hard requirement for ROS files in every upload:

      File: insights-ros-ingress/internal/upload/payload.go (lines 239-242)

      if len(manifest.ResourceOptimizationFiles) == 0 {
          pe.logger.Debug("No ROS files specified in manifest")
          return nil, fmt.Errorf("no ROS files specified in manifest")
      }
      

      This validation rejects any upload that lacks the resource_optimization_files field in the manifest.

      Additional validation in: insights-ros-ingress/internal/upload/handler.go (lines 191-194)

      // Validate that we have ROS files to process
      if len(extractedPayload.ROSFiles) == 0 {
          return fmt.Errorf("no ROS files found in payload")
      }
      

      The Failure Loop Mechanism

      1. Initial State: Fresh deployment, Prometheus has not yet collected sufficient ROS metrics
      2. CMMO Creates Tarball: Packages available Cost Management data, but no ROS files exist yet
      3. Upload Attempt: CMMO uploads to ingress with manifest lacking resource_optimization_files
      4. Rejection: Ingress returns error: "no ROS files specified in manifest"
      5. Retry Logic: CMMO queues the tarball for retry
      6. Loop: CMMO continuously retries the same failed tarball
      7. Queue Blocks: Newer tarballs (which may contain ROS files) wait in queue, never uploaded

      Timeline in Typical Deployment

      Time 0:00  - Deployment complete, CMMO starts
      Time 0:05  - CMMO creates first tarball (NO ROS files - Prometheus hasn't collected enough data)
      Time 0:06  - First upload fails: "no ROS files specified in manifest"
      Time 15:00 - Prometheus has sufficient ROS metrics
      Time 15:05 - CMMO creates tarball WITH ROS files
      Time 15:06 - CMMO STILL retrying first tarball (new tarball stuck in queue)
      

      Result: System stuck in failure loop despite having valid data to upload.

      Steps to Reproduce

      1. Deploy complete Cost Management On-Prem stack on fresh OpenShift cluster (RHBK, Strimzi, Helm chart, CMMO)
      2. Wait 1 hour
      3. Check ingress logs: oc logs -n cost-onprem deployment/cost-onprem-ingress --tail=100 | grep "no ROS files"
      4. Observe repeated upload failures for the same tarball
      5. Check CMMO upload queue: oc exec -n costmanagement-metrics-operator deployment/costmanagement-metrics-operator – ls -lht /tmp/costmanagement-metrics-operator-reports/upload/
      6. Observe multiple tarballs queued with oldest continuously failing

      Expected Behavior

      • Early uploads without ROS files should either succeed (processing only Cost Management data) OR
      • CMMO should skip failed tarballs after a retry limit and process newer tarballs

      Actual Behavior

      • All uploads without ROS files are rejected
      • CMMO retries the same failed tarball infinitely
      • Newer tarballs with ROS files remain stuck in the queue
      • No data reaches the backend until manual intervention

      Error Messages

      Ingress logs:

      ERROR: "failed to extract payload: failed to identify ROS files: no ROS files specified in manifest"
      Upload file: 20251209T225321_702017-cost-mgmt.tar.gz
      

      Workaround

      Option 1: Delete Old Failed Tarballs (Recommended)

      # Identify old tarballs without ROS files
      oc exec -n costmanagement-metrics-operator deployment/costmanagement-metrics-operator -- \
        ls -lt /tmp/costmanagement-metrics-operator-reports/upload/
      
      # Delete tarballs older than when ROS metrics became available
      oc exec -n costmanagement-metrics-operator deployment/costmanagement-metrics-operator -- \
        bash -c 'cd /tmp/costmanagement-metrics-operator-reports/upload && rm -f 20251209* 20251210T00*'
      
      # Verify CMMO picks up newer tarball
      oc logs -n costmanagement-metrics-operator deployment/costmanagement-metrics-operator -f | grep -i upload
      

      Option 2: Force CMMO Restart

      oc delete pod -n costmanagement-metrics-operator -l app=costmanagement-metrics-operator
      

      Proposed Fix

      File: insights-ros-ingress/internal/upload/payload.go:239-241

      Change from (reject uploads without ROS files):

      if len(manifest.ResourceOptimizationFiles) == 0 {
          pe.logger.Debug("No ROS files specified in manifest")
          return nil, fmt.Errorf("no ROS files specified in manifest")
      }
      

      Change to (allow uploads with or without ROS files):

      if len(manifest.ResourceOptimizationFiles) == 0 {
          pe.logger.Info("No ROS files in upload - processing Cost Management data only")
          return make(map[string]string), nil
      } else {
          pe.logger.WithField("ros_files_count", len(manifest.ResourceOptimizationFiles)).
              Info("Found ROS files in payload")
      }
      

      Also update: insights-ros-ingress/internal/upload/handler.go:191-194 to skip ROS processing when no files present instead of returning error.

      Impact of Fix

      • ✓ Eliminates failure loop completely
      • ✓ Allows Cost Management data uploads even without ROS data
      • ✓ ROS data uploads when available
      • ✓ Minimal code changes
      • ⚠ Changes ingress API contract (uploads without ROS files now succeed)

      Version Information

      • Helm Chart: cost-onprem-0.2.1
      • insights-ros-ingress: All versions (issue introduced in commit fd7dfb550c9dc14f2221aa2176c62f6dc63d08ab on 2025-09-03 by Moti Asayag)
      • CMMO: 4.2.0+
      • Repository: https://github.com/RedHatInsights/insights-ros-ingress
      • Affected Files:
        • internal/upload/payload.go (lines 239-241)
        • internal/upload/handler.go (lines 191-194)

      Related Documentation

      • Issue first identified during RHBK + JWT authentication deployment on OpenShift

              Unassigned Unassigned
              chadcrum Chad Crum
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: