-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
v0.2.1
Issue Summary
The Cost Management Metrics Operator (CMMO) enters an infinite failure loop when uploading to the insights-ros-ingress service during initial deployment or when ROS metrics are temporarily unavailable. This occurs because the ingress service has a hard validation requirement for resource_optimization_files in the upload manifest, but CMMO may create tarballs before sufficient ROS metrics have been collected by Prometheus.
Severity
High - Blocks all data uploads (both Cost Management and ROS) until manually resolved.
Affected Components
- Cost Management Metrics Operator (CMMO): Version 4.2.0+
- insights-ros-ingress: All versions (validation introduced in initial commit fd7dfb550 on 2025-09-03)
- Prometheus/ServiceMonitors: User workload monitoring must be enabled
Root Cause
The ingress service has a hard requirement for ROS files in every upload:
File: insights-ros-ingress/internal/upload/payload.go (lines 239-242)
if len(manifest.ResourceOptimizationFiles) == 0 { pe.logger.Debug("No ROS files specified in manifest") return nil, fmt.Errorf("no ROS files specified in manifest") }
This validation rejects any upload that lacks the resource_optimization_files field in the manifest.
Additional validation in: insights-ros-ingress/internal/upload/handler.go (lines 191-194)
// Validate that we have ROS files to process if len(extractedPayload.ROSFiles) == 0 { return fmt.Errorf("no ROS files found in payload") }
The Failure Loop Mechanism
- Initial State: Fresh deployment, Prometheus has not yet collected sufficient ROS metrics
- CMMO Creates Tarball: Packages available Cost Management data, but no ROS files exist yet
- Upload Attempt: CMMO uploads to ingress with manifest lacking resource_optimization_files
- Rejection: Ingress returns error: "no ROS files specified in manifest"
- Retry Logic: CMMO queues the tarball for retry
- Loop: CMMO continuously retries the same failed tarball
- Queue Blocks: Newer tarballs (which may contain ROS files) wait in queue, never uploaded
Timeline in Typical Deployment
Time 0:00 - Deployment complete, CMMO starts Time 0:05 - CMMO creates first tarball (NO ROS files - Prometheus hasn't collected enough data) Time 0:06 - First upload fails: "no ROS files specified in manifest" Time 15:00 - Prometheus has sufficient ROS metrics Time 15:05 - CMMO creates tarball WITH ROS files Time 15:06 - CMMO STILL retrying first tarball (new tarball stuck in queue)
Result: System stuck in failure loop despite having valid data to upload.
Steps to Reproduce
- Deploy complete Cost Management On-Prem stack on fresh OpenShift cluster (RHBK, Strimzi, Helm chart, CMMO)
- Wait 1 hour
- Check ingress logs: oc logs -n cost-onprem deployment/cost-onprem-ingress --tail=100 | grep "no ROS files"
- Observe repeated upload failures for the same tarball
- Check CMMO upload queue: oc exec -n costmanagement-metrics-operator deployment/costmanagement-metrics-operator – ls -lht /tmp/costmanagement-metrics-operator-reports/upload/
- Observe multiple tarballs queued with oldest continuously failing
Expected Behavior
- Early uploads without ROS files should either succeed (processing only Cost Management data) OR
- CMMO should skip failed tarballs after a retry limit and process newer tarballs
Actual Behavior
- All uploads without ROS files are rejected
- CMMO retries the same failed tarball infinitely
- Newer tarballs with ROS files remain stuck in the queue
- No data reaches the backend until manual intervention
Error Messages
Ingress logs:
ERROR: "failed to extract payload: failed to identify ROS files: no ROS files specified in manifest" Upload file: 20251209T225321_702017-cost-mgmt.tar.gz
Workaround
Option 1: Delete Old Failed Tarballs (Recommended)
# Identify old tarballs without ROS files oc exec -n costmanagement-metrics-operator deployment/costmanagement-metrics-operator -- \ ls -lt /tmp/costmanagement-metrics-operator-reports/upload/ # Delete tarballs older than when ROS metrics became available oc exec -n costmanagement-metrics-operator deployment/costmanagement-metrics-operator -- \ bash -c 'cd /tmp/costmanagement-metrics-operator-reports/upload && rm -f 20251209* 20251210T00*' # Verify CMMO picks up newer tarball oc logs -n costmanagement-metrics-operator deployment/costmanagement-metrics-operator -f | grep -i upload
Option 2: Force CMMO Restart
oc delete pod -n costmanagement-metrics-operator -l app=costmanagement-metrics-operator
Proposed Fix
File: insights-ros-ingress/internal/upload/payload.go:239-241
Change from (reject uploads without ROS files):
if len(manifest.ResourceOptimizationFiles) == 0 { pe.logger.Debug("No ROS files specified in manifest") return nil, fmt.Errorf("no ROS files specified in manifest") }
Change to (allow uploads with or without ROS files):
if len(manifest.ResourceOptimizationFiles) == 0 { pe.logger.Info("No ROS files in upload - processing Cost Management data only") return make(map[string]string), nil } else { pe.logger.WithField("ros_files_count", len(manifest.ResourceOptimizationFiles)). Info("Found ROS files in payload") }
Also update: insights-ros-ingress/internal/upload/handler.go:191-194 to skip ROS processing when no files present instead of returning error.
Impact of Fix
- ✓ Eliminates failure loop completely
- ✓ Allows Cost Management data uploads even without ROS data
- ✓ ROS data uploads when available
- ✓ Minimal code changes
- ⚠ Changes ingress API contract (uploads without ROS files now succeed)
Version Information
- Helm Chart: cost-onprem-0.2.1
- insights-ros-ingress: All versions (issue introduced in commit fd7dfb550c9dc14f2221aa2176c62f6dc63d08ab on 2025-09-03 by Moti Asayag)
- CMMO: 4.2.0+
- Repository: https://github.com/RedHatInsights/insights-ros-ingress
- Affected Files:
- internal/upload/payload.go (lines 239-241)
- internal/upload/handler.go (lines 191-194)
Related Documentation
- Issue first identified during RHBK + JWT authentication deployment on OpenShift