Uploaded image for project: 'FlightPath'
  1. FlightPath
  2. FLPATH-2908

Missing documentation and automation for enabling OpenShift User Workload Monitoring

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False

      Description

      User Workload Monitoring is a critical prerequisite for the Cost Management On-Premise stack to function on OpenShift, but it is not documented in the deployment guides and not enabled by the Helm chart. This causes a *silent failure* where the deployment appears successful but the data pipeline produces no metrics/recommendations.

      Impact

      • *Severity*: High - Blocks the entire ROS data pipeline from working
      • *User Experience*:
        • Deployment completes successfully with all pods running
        • ServiceMonitors are created but cannot scrape metrics
        • ROS CSV files are empty or missing data
        • No recommendations are generated
        • Silent failure - everything looks healthy but produces no data
      • *Affected Documentation*:
        • docs/force-operator-upload.md - Contains broken reference to installation.md
        • docs/installation.md - Does not document this prerequisite
        • Helm chart NOTES.txt - Does not warn about this requirement
      • *Installation/Testing Outcome*: Deployment succeeds but data pipeline is non-functional

      Root Cause

      OpenShift User Workload Monitoring must be explicitly enabled for Prometheus to scrape ServiceMonitors in user namespaces. The Helm chart successfully deploys ServiceMonitors, but without user workload monitoring enabled, no Prometheus instance exists to read them.

      Current State

      • What Exists:
        • ✅ ServiceMonitors are deployed by Helm chart (kruize, rosocp-api, processor, recommendation-poller)
        • openshift-user-workload-monitoring namespace exists (created automatically by OpenShift)
        • ❌ No prometheus-user-workload pods running in that namespace
        • ❌ No documentation in deployment guides
        • ❌ No automation in Helm chart to enable it
      • Broken Documentation Reference:
        • docs/force-operator-upload.md line 58 says: "User-workload monitoring is enabled (see installation.md)"
        • docs/installation.md does NOT document this prerequisite

      Evidence

      ServiceMonitors Created Successfully

      $ oc get servicemonitors -n cost-onprem
      NAME                                                                 AGE
      cost-onprem-ros-ocp-kruize                                           22m
      cost-onprem-ros-ocp-rosocp-api                                       22m
      cost-onprem-ros-ocp-rosocp-processor                                 22m
      cost-onprem-ros-ocp-rosocp-recommendation-poller                     22m
      

      But No Prometheus Pods to Read Them

      $ oc get pods -n openshift-user-workload-monitoring
      No resources found in openshift-user-workload-monitoring namespace.
      
      $ oc get configmap cluster-monitoring-config -n openshift-monitoring
      Error from server (NotFound): configmaps "cluster-monitoring-config" not found
      

      This proves:

      • ServiceMonitors were deployed by the chart
      • User workload monitoring was never enabled
      • No Prometheus instance exists to scrape the ServiceMonitors

      Expected Behavior

      Users should be guided to enable user workload monitoring as part of the deployment process, either through:

      Option 1: Documentation (Minimum Fix)

      • Update docs/installation.md to document this prerequisite
      • Add clear instructions on how to enable it
      • Include verification steps

      Option 2: Helm Chart Automation (Recommended)

      • Helm chart could automatically create the cluster-monitoring-config ConfigMap
      • Template in cost-onprem/templates/monitoring/ directory
      • Conditional on .Values.platform being OpenShift
      • Include in Helm NOTES.txt output to inform users it was enabled

      Example Helm template:

      {{- if eq (include "cost-onprem.platform.isOpenShift" .) "true" -}}
      apiVersion: v1
      kind: ConfigMap
      metadata: 
        name: cluster-monitoring-config
        namespace: openshift-monitoring
      data: 
        config.yaml: |
          enableUserWorkload: true
      {{- end }}
      

      Current Workaround

      Users must manually enable user workload monitoring:

      cat > /tmp/enable-user-workload-monitoring.yaml <<'EOF'
      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: cluster-monitoring-config
        namespace: openshift-monitoring
      data:
        config.yaml: |
          enableUserWorkload: true
      EOF
      
      oc apply -f /tmp/enable-user-workload-monitoring.yaml
      
      h3. Verify
      oc get pods -n openshift-user-workload-monitoring
      Should show: prometheus-user-workload-0, prometheus-user-workload-1, etc.
      

      Steps to Reproduce

      Deploy Without User Workload Monitoring

      • Deploy RHBK
      • Deploy Strimzi
      • Deploy cost-onprem chart with export JWT_AUTH_ENABLED=true
      • Observe: All pods running, ServiceMonitors created
      • Observe: openshift-user-workload-monitoring namespace has no pods
      • Observe: ROS data pipeline produces no metrics

      Proposed Fix

      Documentation Updates

      • Update docs/installation.md:\
        Add a new section "Prerequisites for OpenShift" that includes:
        • Enabling user workload monitoring
        • Verification steps
        • Expected resource creation
      • Update docs/force-operator-upload.md:\
        Fix the broken reference or provide inline instructions instead of referencing installation.md
      • Update Helm NOTES.txt:\
        Add a section for OpenShift deployments warning about this requirement

      Helm Chart Enhancement (Alternative/Additional)

      Create cost-onprem/templates/monitoring/cluster-monitoring-config.yaml that:

      • Conditionally deploys on OpenShift only
      • Creates the cluster-monitoring-config ConfigMap in openshift-monitoring namespace
      • Enables enableUserWorkload: true
      • Includes appropriate annotations and labels

      Environment Details

      • *Repository*: https://github.com/insights-onprem/cost-onprem-chart
      • *Chart Version*: v0.2.0
      • *Git Commit*: 2ee0206
      • *OpenShift Version*: 4.18.26
      • *Kubernetes Version*: v1.31.13
      • *Deployment Method*: ./scripts/install-helm-chart.sh with JWT_AUTH_ENABLED=true
      • *ServiceMonitors Created*: Yes (4 ServiceMonitors deployed successfully)
      • *User Workload Monitoring Enabled*: No (missing ConfigMap and pods)
      • *Result*: Silent failure - deployment healthy but no data pipeline functionality

              rh-ee-masayag Moti Asayag
              chadcrum Chad Crum
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: