ENV
OCP 4.15.11
hypershift operator
ISSUE
The hypershift operator is generating a configuration for sending the metrics to Red Hat Telemetry. The configuration is visible in [1] and it's like
$ oc -n openshift-user-workload-monitoring get cm user-workload-monitoring-config -o yaml apiVersion: v1 data: config.yaml: | prometheus: remoteWrite: - authorization: credentials: key: token name: telemetry-remote-write type: Bearer queueConfig: batchSendDeadline: 1m capacity: 30000 maxBackoff: 256s maxSamplesPerSend: 10000 minBackoff: 1s url: https://infogw.api.openshift.com/metrics/v1/receive writeRelabelConfigs: - action: keep regex: (count:up0|count:up1|cluster_version|cluster_version_available_updates|cluster_version_capability|cluster_operator_up|cluster_operator_conditions|cluster_version_payload|cluster_installer|cluster_infrastructure_provider|cluster_feature_set|instance:etcd_object_counts:sum|ALERTS|code:apiserver_request_total:rate:sum|cluster:capacity_cpu_cores:sum|cluster:capacity_memory_bytes:sum|cluster:cpu_usage_cores:sum|cluster:memory_usage_bytes:sum|openshift:cpu_usage_cores:sum|openshift:memory_usage_bytes:sum|workload:cpu_usage_cores:sum|workload:memory_usage_bytes:sum|cluster:virt_platform_nodes:sum|cluster:node_instance_type_count:sum|cnv:vmi_status_running:count|cluster:vmi_request_cpu_cores:sum|node_role_os_version_machine:cpu_capacity_cores:sum|node_role_os_version_machine:cpu_capacity_sockets:sum|subscription_sync_total|olm_resolution_duration_seconds|csv_succeeded|csv_abnormal|cluster:kube_persistentvolumeclaim_resource_requests_storage_bytes:provisioner:sum|cluster:kubelet_volume_stats_used_bytes:provisioner:sum|ceph_cluster_total_bytes|ceph_cluster_total_used_raw_bytes|ceph_health_status|odf_system_raw_capacity_total_bytes|odf_system_raw_capacity_used_bytes|odf_system_health_status|job:ceph_osd_metadata:count|job:kube_pv:count|job:odf_system_pvs:count|job:ceph_pools_iops:total|job:ceph_pools_iops_bytes:total|job:ceph_versions_running:count|job:noobaa_total_unhealthy_buckets:sum|job:noobaa_bucket_count:sum|job:noobaa_total_object_count:sum|odf_system_bucket_count|odf_system_objects_total|noobaa_accounts_num|noobaa_total_usage|console_url|cluster:console_auth_login_requests_total:sum|cluster:console_auth_login_successes_total:sum|cluster:console_auth_login_failures_total:sum|cluster:console_auth_logout_requests_total:sum|cluster:console_usage_users:max|cluster:console_plugins_info:max|cluster:console_customization_perspectives_info:max|cluster:ovnkube_master_egress_routing_via_host:max|cluster:network_attachment_definition_instances:max|cluster:network_attachment_definition_enabled_instance_up:max|cluster:ingress_controller_aws_nlb_active:sum|cluster:route_metrics_controller_routes_per_shard:min|cluster:route_metrics_controller_routes_per_shard:max|cluster:route_metrics_controller_routes_per_shard:avg|cluster:route_metrics_controller_routes_per_shard:median|cluster:openshift_route_info:tls_termination:sum|insightsclient_request_send_total|cam_app_workload_migrations|cluster:apiserver_current_inflight_requests:sum:max_over_time:2m|cluster:alertmanager_integrations:max|cluster:telemetry_selected_series:count|openshift:prometheus_tsdb_head_series:sum|openshift:prometheus_tsdb_head_samples_appended_total:sum|monitoring:container_memory_working_set_bytes:sum|namespace_job:scrape_series_added:topk3_sum1h|namespace_job:scrape_samples_post_metric_relabeling:topk3|monitoring:haproxy_server_http_responses_total:sum|profile:cluster_monitoring_operator_collection_profile:max|rhmi_status|status:upgrading:version:rhoam_state:max|state:rhoam_critical_alerts:max|state:rhoam_warning_alerts:max|rhoam_7d_slo_percentile:max|rhoam_7d_slo_remaining_error_budget:max|cluster_legacy_scheduler_policy|cluster_master_schedulable|che_workspace_status|che_workspace_started_total|che_workspace_failure_total|che_workspace_start_time_seconds_sum|che_workspace_start_time_seconds_count|cco_credentials_mode|cluster:kube_persistentvolume_plugin_type_counts:sum|acm_managed_cluster_info|acm_console_page_count:sum|cluster:vsphere_vcenter_info:sum|cluster:vsphere_esxi_version_total:sum|cluster:vsphere_node_hw_version_total:sum|openshift:build_by_strategy:sum|rhods_aggregate_availability|rhods_total_users|instance:etcd_disk_wal_fsync_duration_seconds:histogram_quantile|instance:etcd_mvcc_db_total_size_in_bytes:sum|instance:etcd_network_peer_round_trip_time_seconds:histogram_quantile|instance:etcd_mvcc_db_total_size_in_use_in_bytes:sum|instance:etcd_disk_backend_commit_duration_seconds:histogram_quantile|jaeger_operator_instances_storage_types|jaeger_operator_instances_strategies|jaeger_operator_instances_agent_strategies|appsvcs:cores_by_product:sum|nto_custom_profiles:count|openshift_csi_share_configmap|openshift_csi_share_secret|openshift_csi_share_mount_failures_total|openshift_csi_share_mount_requests_total|eo_es_storage_info|eo_es_redundancy_policy_info|eo_es_defined_delete_namespaces_total|eo_es_misconfigured_memory_resources_info|cluster:eo_es_data_nodes_total:max|cluster:eo_es_documents_created_total:sum|cluster:eo_es_documents_deleted_total:sum|pod:eo_es_shards_total:max|eo_es_cluster_management_state_info|imageregistry:imagestreamtags_count:sum|imageregistry:operations_count:sum|log_logging_info|log_collector_error_count_total|log_forwarder_pipeline_info|log_forwarder_input_info|log_forwarder_output_info|cluster:log_collected_bytes_total:sum|cluster:log_logged_bytes_total:sum|cluster:kata_monitor_running_shim_count:sum|platform:hypershift_hostedclusters:max|platform:hypershift_nodepools:max|namespace:noobaa_unhealthy_bucket_claims:max|namespace:noobaa_buckets_claims:max|namespace:noobaa_unhealthy_namespace_resources:max|namespace:noobaa_namespace_resources:max|namespace:noobaa_unhealthy_namespace_buckets:max|namespace:noobaa_namespace_buckets:max|namespace:noobaa_accounts:max|namespace:noobaa_usage:max|namespace:noobaa_system_health_status:max|ocs_advanced_feature_usage|os_image_url_override:sum|cluster:vsphere_topology_tags:max|cluster:vsphere_infrastructure_failure_domains:max|cluster:vsphere_csi_migration:max|apiserver_list_watch_request_success_total:rate:sum|rhacs:telemetry:rox_central_info|rhacs:telemetry:rox_central_secured_clusters|rhacs:telemetry:rox_central_secured_nodes|rhacs:telemetry:rox_central_secured_vcpus|rhacs:telemetry:rox_sensor_info|cluster:volume_manager_selinux_pod_context_mismatch_total|cluster:volume_manager_selinux_volume_context_mismatch_warnings_total|cluster:volume_manager_selinux_volume_context_mismatch_errors_total|cluster:volume_manager_selinux_volumes_admitted_total|ols:provider_model_configuration|ols:rest_api_query_calls_total:2xx|ols:rest_api_query_calls_total:4xx|ols:rest_api_query_calls_total:5xx) sourceLabels: - __name__ kind: ConfigMap metadata: creationTimestamp: "2024-01-10T15:59:54Z" name: user-workload-monitoring-config namespace: openshift-user-workload-monitoring resourceVersion: "1335981238" uid: c815d318-e05e-4701-8fb9-12918635d2a4
When it's introduced this configuration by the HyperShift operator:
$ pod=$(oc -n hypershift get pod -o jsonpath='{.items[0].metadata.name}') $ oc -n hypershift logs $pod |grep "reconciling UWM telemetry"|head -1 2024-06-19T09:25:18.812388415Z {"level":"info","ts":"2024-06-19T09:25:18Z","msg":"reconciling UWM telemetry","controller":"deployment","controllerGroup":"apps","controllerKind":"Deployment","Deployment":{"name":"operator","namespace":"hypershift"},"namespace":"hypershift","name":"operator","reconcileID":"da400de1-xxxx-xxxx-xxxx-77ea58df5194"}
It's visible errors in the prometheus pods not able to send the metrics to the Red Hat Telemetry server with the error below. One entry each minute:
$ oc -n openshift-user-workload-monitoring logs prometheus-user-workload-0 -c prometheus|grep "a required label is missing from the metric"|head -1 2024-06-20T15:56:33.282092583Z ts=2024-06-20T15:56:33.281Z caller=dedupe.go:112 component=remote level=error remote_name=5a6833 url=https://infogw.api.openshift.com/metrics/v1/receive msg="non-recoverable error" count=1 exemplarCount=0 err="server returned HTTP status 400 Bad Request: a required label is missing from the metric" $ oc -n openshift-user-workload-monitoring logs prometheus-user-workload-0 -c prometheus|grep -c "a required label is missing from the metric" 5477
NOTES
This error only impacts to the metrics sent to the Red Hat Telemetry, but not Business functionality, it's good to silence the alert until it get fixed. For silence this alert [2].
Also, if the configmap `user-workload-monitoring-config` is modified for deleting the modification done by the HyperShift Operator, this change will be reverted again by the HyperShift Operator.
[1] https://github.com/openshift/hypershift/blob/022368e168946dbf4bd7416589ec644328c7acb7/hypershift-operator/controllers/uwmtelemetry/uwm_telemetry_test.go
[2] https://docs.openshift.com/container-platform/4.15/observability/monitoring/managing-alerts.html#silencing-alerts_managing-alerts
- relates to
-
OCPBUGS-56425 Empty remoteWrite to telemeter injected by hypershift on user-workload-monitoring-config cm
-
- Verified
-
- links to