Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-36389

required label missing from the metric error sending metrics to Telemetry

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.15.z
    • HyperShift
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • No
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      ENV

      OCP 4.15.11
      hypershift operator

      ISSUE

      The hypershift operator is generating a configuration for sending the metrics to Red Hat Telemetry. The configuration is visible in [1] and it's like

      $ oc -n openshift-user-workload-monitoring get cm user-workload-monitoring-config -o yaml 
      apiVersion: v1
      data:
        config.yaml: |
          prometheus:
            remoteWrite:
            - authorization:
                credentials:
                  key: token
                  name: telemetry-remote-write
                type: Bearer
              queueConfig:
                batchSendDeadline: 1m
                capacity: 30000
                maxBackoff: 256s
                maxSamplesPerSend: 10000
                minBackoff: 1s
              url: https://infogw.api.openshift.com/metrics/v1/receive
              writeRelabelConfigs:
              - action: keep
                regex: (count:up0|count:up1|cluster_version|cluster_version_available_updates|cluster_version_capability|cluster_operator_up|cluster_operator_conditions|cluster_version_payload|cluster_installer|cluster_infrastructure_provider|cluster_feature_set|instance:etcd_object_counts:sum|ALERTS|code:apiserver_request_total:rate:sum|cluster:capacity_cpu_cores:sum|cluster:capacity_memory_bytes:sum|cluster:cpu_usage_cores:sum|cluster:memory_usage_bytes:sum|openshift:cpu_usage_cores:sum|openshift:memory_usage_bytes:sum|workload:cpu_usage_cores:sum|workload:memory_usage_bytes:sum|cluster:virt_platform_nodes:sum|cluster:node_instance_type_count:sum|cnv:vmi_status_running:count|cluster:vmi_request_cpu_cores:sum|node_role_os_version_machine:cpu_capacity_cores:sum|node_role_os_version_machine:cpu_capacity_sockets:sum|subscription_sync_total|olm_resolution_duration_seconds|csv_succeeded|csv_abnormal|cluster:kube_persistentvolumeclaim_resource_requests_storage_bytes:provisioner:sum|cluster:kubelet_volume_stats_used_bytes:provisioner:sum|ceph_cluster_total_bytes|ceph_cluster_total_used_raw_bytes|ceph_health_status|odf_system_raw_capacity_total_bytes|odf_system_raw_capacity_used_bytes|odf_system_health_status|job:ceph_osd_metadata:count|job:kube_pv:count|job:odf_system_pvs:count|job:ceph_pools_iops:total|job:ceph_pools_iops_bytes:total|job:ceph_versions_running:count|job:noobaa_total_unhealthy_buckets:sum|job:noobaa_bucket_count:sum|job:noobaa_total_object_count:sum|odf_system_bucket_count|odf_system_objects_total|noobaa_accounts_num|noobaa_total_usage|console_url|cluster:console_auth_login_requests_total:sum|cluster:console_auth_login_successes_total:sum|cluster:console_auth_login_failures_total:sum|cluster:console_auth_logout_requests_total:sum|cluster:console_usage_users:max|cluster:console_plugins_info:max|cluster:console_customization_perspectives_info:max|cluster:ovnkube_master_egress_routing_via_host:max|cluster:network_attachment_definition_instances:max|cluster:network_attachment_definition_enabled_instance_up:max|cluster:ingress_controller_aws_nlb_active:sum|cluster:route_metrics_controller_routes_per_shard:min|cluster:route_metrics_controller_routes_per_shard:max|cluster:route_metrics_controller_routes_per_shard:avg|cluster:route_metrics_controller_routes_per_shard:median|cluster:openshift_route_info:tls_termination:sum|insightsclient_request_send_total|cam_app_workload_migrations|cluster:apiserver_current_inflight_requests:sum:max_over_time:2m|cluster:alertmanager_integrations:max|cluster:telemetry_selected_series:count|openshift:prometheus_tsdb_head_series:sum|openshift:prometheus_tsdb_head_samples_appended_total:sum|monitoring:container_memory_working_set_bytes:sum|namespace_job:scrape_series_added:topk3_sum1h|namespace_job:scrape_samples_post_metric_relabeling:topk3|monitoring:haproxy_server_http_responses_total:sum|profile:cluster_monitoring_operator_collection_profile:max|rhmi_status|status:upgrading:version:rhoam_state:max|state:rhoam_critical_alerts:max|state:rhoam_warning_alerts:max|rhoam_7d_slo_percentile:max|rhoam_7d_slo_remaining_error_budget:max|cluster_legacy_scheduler_policy|cluster_master_schedulable|che_workspace_status|che_workspace_started_total|che_workspace_failure_total|che_workspace_start_time_seconds_sum|che_workspace_start_time_seconds_count|cco_credentials_mode|cluster:kube_persistentvolume_plugin_type_counts:sum|acm_managed_cluster_info|acm_console_page_count:sum|cluster:vsphere_vcenter_info:sum|cluster:vsphere_esxi_version_total:sum|cluster:vsphere_node_hw_version_total:sum|openshift:build_by_strategy:sum|rhods_aggregate_availability|rhods_total_users|instance:etcd_disk_wal_fsync_duration_seconds:histogram_quantile|instance:etcd_mvcc_db_total_size_in_bytes:sum|instance:etcd_network_peer_round_trip_time_seconds:histogram_quantile|instance:etcd_mvcc_db_total_size_in_use_in_bytes:sum|instance:etcd_disk_backend_commit_duration_seconds:histogram_quantile|jaeger_operator_instances_storage_types|jaeger_operator_instances_strategies|jaeger_operator_instances_agent_strategies|appsvcs:cores_by_product:sum|nto_custom_profiles:count|openshift_csi_share_configmap|openshift_csi_share_secret|openshift_csi_share_mount_failures_total|openshift_csi_share_mount_requests_total|eo_es_storage_info|eo_es_redundancy_policy_info|eo_es_defined_delete_namespaces_total|eo_es_misconfigured_memory_resources_info|cluster:eo_es_data_nodes_total:max|cluster:eo_es_documents_created_total:sum|cluster:eo_es_documents_deleted_total:sum|pod:eo_es_shards_total:max|eo_es_cluster_management_state_info|imageregistry:imagestreamtags_count:sum|imageregistry:operations_count:sum|log_logging_info|log_collector_error_count_total|log_forwarder_pipeline_info|log_forwarder_input_info|log_forwarder_output_info|cluster:log_collected_bytes_total:sum|cluster:log_logged_bytes_total:sum|cluster:kata_monitor_running_shim_count:sum|platform:hypershift_hostedclusters:max|platform:hypershift_nodepools:max|namespace:noobaa_unhealthy_bucket_claims:max|namespace:noobaa_buckets_claims:max|namespace:noobaa_unhealthy_namespace_resources:max|namespace:noobaa_namespace_resources:max|namespace:noobaa_unhealthy_namespace_buckets:max|namespace:noobaa_namespace_buckets:max|namespace:noobaa_accounts:max|namespace:noobaa_usage:max|namespace:noobaa_system_health_status:max|ocs_advanced_feature_usage|os_image_url_override:sum|cluster:vsphere_topology_tags:max|cluster:vsphere_infrastructure_failure_domains:max|cluster:vsphere_csi_migration:max|apiserver_list_watch_request_success_total:rate:sum|rhacs:telemetry:rox_central_info|rhacs:telemetry:rox_central_secured_clusters|rhacs:telemetry:rox_central_secured_nodes|rhacs:telemetry:rox_central_secured_vcpus|rhacs:telemetry:rox_sensor_info|cluster:volume_manager_selinux_pod_context_mismatch_total|cluster:volume_manager_selinux_volume_context_mismatch_warnings_total|cluster:volume_manager_selinux_volume_context_mismatch_errors_total|cluster:volume_manager_selinux_volumes_admitted_total|ols:provider_model_configuration|ols:rest_api_query_calls_total:2xx|ols:rest_api_query_calls_total:4xx|ols:rest_api_query_calls_total:5xx)
                sourceLabels:
                - __name__
      kind: ConfigMap
      metadata:
        creationTimestamp: "2024-01-10T15:59:54Z"
        name: user-workload-monitoring-config
        namespace: openshift-user-workload-monitoring
        resourceVersion: "1335981238"
        uid: c815d318-e05e-4701-8fb9-12918635d2a4
      

      When it's introduced this configuration by the HyperShift operator:

      $ pod=$(oc -n hypershift get pod -o jsonpath='{.items[0].metadata.name}')
      $ oc -n hypershift logs $pod  |grep "reconciling UWM telemetry"|head -1
      2024-06-19T09:25:18.812388415Z {"level":"info","ts":"2024-06-19T09:25:18Z","msg":"reconciling UWM telemetry","controller":"deployment","controllerGroup":"apps","controllerKind":"Deployment","Deployment":{"name":"operator","namespace":"hypershift"},"namespace":"hypershift","name":"operator","reconcileID":"da400de1-xxxx-xxxx-xxxx-77ea58df5194"}
      

      It's visible errors in the prometheus pods not able to send the metrics to the Red Hat Telemetry server with the error below. One entry each minute:

      $ oc -n openshift-user-workload-monitoring logs prometheus-user-workload-0 -c prometheus|grep "a required label is missing from the metric"|head -1
      2024-06-20T15:56:33.282092583Z ts=2024-06-20T15:56:33.281Z caller=dedupe.go:112 component=remote level=error remote_name=5a6833 url=https://infogw.api.openshift.com/metrics/v1/receive msg="non-recoverable error" count=1 exemplarCount=0 err="server returned HTTP status 400 Bad Request: a required label is missing from the metric"
      $ oc -n openshift-user-workload-monitoring logs prometheus-user-workload-0 -c prometheus|grep -c "a required label is missing from the metric"
      5477
      
      NOTES

      This error only impacts to the metrics sent to the Red Hat Telemetry, but not Business functionality, it's good to silence the alert until it get fixed. For silence this alert [2].

      Also, if the configmap `user-workload-monitoring-config` is modified for deleting the modification done by the HyperShift Operator, this change will be reverted again by the HyperShift Operator.

      [1] https://github.com/openshift/hypershift/blob/022368e168946dbf4bd7416589ec644328c7acb7/hypershift-operator/controllers/uwmtelemetry/uwm_telemetry_test.go
      [2] https://docs.openshift.com/container-platform/4.15/observability/monitoring/managing-alerts.html#silencing-alerts_managing-alerts

              cewong@redhat.com Cesar Wong
              rhn-support-ocasalsa Oscar Casal Sanchez
              None
              Jan Fajerski
              None
              None
              Votes:
              1 Vote for this issue
              Watchers:
              13 Start watching this issue

                Created:
                Updated: