-
Bug
-
Resolution: Done
-
Undefined
-
None
-
None
-
Quality / Stability / Reliability
-
False
-
-
False
-
-
-
None
scaling our large 180 BM worker node cluster to 50K + vms. At around 40K vms the metric-collector hits a couple of issues. ACM 2.15/OCP 4.20
In the log i see
1587 level=error caller=logger.go:60 ts=2025-11-25T15:29:35.232775732Z shard=0 component=forwarder component=metricsclient msg="error reading body" err="the incoming sample data is too long" 1588
I tried to mitigate the above ^ issue by add more resources to the pod, which worked. Suspect caused by the /federate response being larger than the metric-collector could buffer.
I then hit another issue (error from log below) . Which I assume is due to /federate request being too large as prom evaluates a very large number of match[] filters against millions of time series and causing it to exceed the allowed time window with a timeout seen as "context deadline exceed"
component=forwarder/worker msg="unable to forward results" err="Get "https://prometheus-k8s.openshift-monitoring.svc:9091/federate?match%5B%5D=%7B_name%3D%22%3Anode_memory_MemAvailable_bytes%3Asum%22%7D&match%5B%5D=%7Bname%3D%22ALERTS%22%7D&match%5B%5D=%7Bname%3D%22acm_managed_cluster_labels%22%7D&match%5B%5D=%7Bname%3D%22acm_rs%3Acluster%3Acpu_recommendation%22%7D&match%5B%5D=%7Bname%3D%22acm_rs%3Acluster%3Acpu_request%22%7D&match%5B%5D=%7Bname%3D%22acm_rs%3Acluster%3Acpu_usage%22%7D&match%5B%5D=%7Bname%3D%22acm_rs%3Acluster%3Amemory_recommendation%22%7D&match%5B%5D=%7Bname%3D%22acm_rs%3Acluster%3Amemory_request%22%7D&match%5B%5D=%7Bname%3D%22acm_rs%3Acluster%3Amemory_usage%22%7D&match%5B%5D=%7Bname%3D%22acm_rs%3Anamespace%3Acpu_recommendation%22%7D&match%5B%5D=%7Bname%3D%22acm_rs%3Anamespace%3Acpu_request%22%7D&match%5B%5D=%7Bname%3D%22acm_rs%3Anamespace%3Acpu_request_hard%22%7D&match%5B%5D=%7Bname%3D%22acm_rs%3Anamespace%3Acpu_usage%22%7D&match%5B%5D=%7Bname%3D%22acm_rs%3Anamespace%3Amemory_recommendation%22%7D&match%5B%5D=%7Bname%3D%22acm_rs%3Anamespace%3Amemory_request%22%7D&match%5B%5D=%7Bname%3D%22acm_rs%3Anamespace%3Amemory_request_hard%22%7D&match%5B%5D=%7Bname%3D%22acm_rs%3Anamespace%3Amemory_usage%22%7D&match%5B%5D=%7Bname%3D%22acm_rs_vm%3Acluster%3Acpu_recommendation%22%7D&match%5B%5D=%7Bname%3D%22acm_rs_vm%3Acluster%3Acpu_request%22%7D&match%5B%5D=%7Bname%3D%22acm_rs_vm%3Acluster%3Acpu_usage%22%7D&match%5B%5D=%7Bname%3D%22acm_rs_vm%3Acluster%3Amemory_recommendation%22%7D&match%5B%5D=%7Bname%3D%22acm_rs_vm%3Acluster%3Amemory_request%22%7D&match%5B%5D=%7Bname%3D%22acm_rs_vm%3Acluster%3Amemory_usage%22%7D&match%5B%5D=%7Bname%3D%22acm_rs_vm%3Anamespace%3Acpu_recommendation%22%7D&match%5B%5D=%7Bname%3D%22acm_rs_vm%3Anamespace%3Acpu_request%22%7D&match%5B%5D=%7Bname%3D%22acm_rs_vm%3Anamespace%3Acpu_usage%22%7D&match%5B%5D=%7Bname%3D%22acm_rs_vm%3Anamespace%3Amemory_recommendation%22%7D&match%5B%5D=%7Bname%3D%22acm_rs_vm%3Anamespace%3Amemory_request%22%7D&match%5B%5D=%7Bname%3D%22acm_rs_vm%3Anamespace%3Amemory_usage%22%7D&match%5B%5D=%7Bname%3D%22authenticated_user_requests%22%7D&match%5B%5D=%7Bname%3D%22authentication_attempts%22%7D&match%5B%5D=%7Bname%3D%22cluster%3Acapacity_cpu_cores%3Asum%22%7D&match%5B%5D=%7Bname%3D%22cluster%3Acapacity_memory_bytes%3Asum%22%7D&match%5B%5D=%7Bname%3D%22cluster%3Acontainer_cpu_usage%3Aratio%22%7D&match%5B%5D=%7Bname%3D%22cluster%3Acontainer_spec_cpu_shares%3Aratio%22%7D&match%5B%5D=%7Bname%3D%22cluster%3Acpu_usage_cores%3Asum%22%7D&match%5B%5D=%7Bname%3D%22cluster%3Amemory_usage%3Aratio%22%7D&match%5B%5D=%7Bname%3D%22cluster%3Amemory_usage_bytes%3Asum%22%7D&match%5B%5D=%7Bname%3D%22cluster%3Anode_cpu%3Aratio%22%7D&match%5B%5D=%7Bname%3D%22cluster%3Apolicy_governance_info%3Apropagated_count%22%7D&match%5B%5D=%7Bname%3D%22cluster%3Apolicy_governance_info%3Apropagated_noncompliant_count%22%7D&match%5B%5D=%7Bname%3D%22cluster%3Ausage%3Aresources%3Asum%22%7D&match%5B%5D=%7Bname%3D%22cluster_health_components_map%22%7D&match%5B%5D=%7Bname%3D%22cluster_infrastructure_provider%22%7D&match%5B%5D=%7Bname%3D%22cluster_operator_conditions%22%7D&match%5B%5D=%7Bname%3D%22cluster_operator_up%22%7D&match%5B%5D=%7Bname%3D%22cluster_policy_governance_info%22%7D&match%5B%5D=%7Bname%3D%22cluster_version%22%7D&match%5B%5D=%7Bname%3D%22cluster_version_payload%22%7D&match%5B%5D=%7Bname%3D%22cnv%3Avmi_status_running%3Acount%22%7D&match%5B%5D=%7Bname%3D%22console_url%22%7D&match%5B%5D=%7Bname%3D%22container_cpu_cfs_periods_total%22%7D&match%5B%5D=%7Bname%3D%22container_cpu_cfs_throttled_periods_total%22%7D&match%5B%5D=%7Bname%3D%22container_memory_cache%22%2Ccontainer%21%3D%22%22%7D&match%5B%5D=%7Bname%3D%22container_memory_rss%22%2Ccontainer%21%3D%22%22%7D&match%5B%5D=%7Bname%3D%22container_memory_swap%22%2Ccontainer%21%3D%22%22%7D&match%5B%5D=%7Bname%3D%22container_memory_working_set_bytes%22%2Ccontainer%21%3D%22%22%7D&match%5B%5D=%7Bname%3D%22container_spec_cpu_quota%22%7D&match%5B%5D=%7Bname%3D%22coredns_dns_request_duration_seconds_sum%22%7D&match%5B%5D=%7Bname%3D%22coredns_dns_requests_total%22%7D&match%5B%5D=%7Bname%3D%22coredns_forward_responses_total%22%7D&match%5B%5D=%7Bname%3D%22csv_abnormal%22%7D&match%5B%5D=%7Bname%3D%22csv_succeeded%22%7D&match%5B%5D=%7Bname%3D%22descheduler%3Aaverageworkersutilization%3Acpu%3Aavg1m%22%7D&match%5B%5D=%7Bname%3D%22descheduler%3Anodepressure%3Acpu%3Aavg1m%22%7D&match%5B%5D=%7Bname%3D%22descheduler%3Anodeutilization%3Acpu%3Aavg1m%22%7D&match%5B%5D=%7Bname%3D%22etcd_debugging_mvcc_db_total_size_in_bytes%22%7D&match%5B%5D=%7Bname%3D%22etcd_debugging_snap_save_total_duration_seconds_sum%22%7D&match%5B%5D=%7Bname%3D%22etcd_disk_backend_commit_duration_seconds_bucket%22%7D&match%5B%5D=%7Bname%3D%22etcd_disk_backend_commit_duration_seconds_sum%22%7D&match%5B%5D=%7Bname%3D%22etcd_disk_wal_fsync_duration_seconds_bucket%22%7D&match%5B%5D=%7Bname%3D%22etcd_disk_wal_fsync_duration_seconds_sum%22%7D&match%5B%5D=%7Bname%3D%22etcd_mvcc_db_total_size_in_bytes%22%7D&match%5B%5D=%7Bname%3D%22etcd_network_client_grpc_received_bytes_total%22%7D&match%5B%5D=%7Bname%3D%22etcd_network_client_grpc_sent_bytes_total%22%7D&match%5B%5D=%7Bname%3D%22etcd_network_peer_received_bytes_total%22%7D&match%5B%5D=%7Bname%3D%22etcd_network_peer_sent_bytes_total%22%7D&match%5B%5D=%7Bname%3D%22etcd_object_counts%22%7D&match%5B%5D=%7Bname%3D%22etcd_server_client_requests_total%22%7D&match%5B%5D=%7Bname%3D%22etcd_server_has_leader%22%7D&match%5B%5D=%7Bname%3D%22etcd_server_health_failures%22%7D&match%5B%5D=%7Bname%3D%22etcd_server_leader_changes_seen_total%22%7D&match%5B%5D=%7Bname%3D%22etcd_server_proposals_applied_total%22%7D&match%5B%5D=%7Bname%3D%22etcd_server_proposals_committed_total%22%7D&match%5B%5D=%7Bname%3D%22etcd_server_proposals_failed_total%22%7D&match%5B%5D=%7Bname%3D%22etcd_server_proposals_pending%22%7D&match%5B%5D=%7Bname%3D%22etcd_server_quota_backend_bytes%22%7D&match%5B%5D=%7Bname%3D%22go_goroutines%22%2Cjob%3D%22apiserver%22%7D&match%5B%5D=%7Bname%3D%22grpc_server_started_total%22%7D&match%5B%5D=%7Bname%3D%22haproxy_backend_connection_errors_total%22%7D&match%5B%5D=%7Bname%3D%22haproxy_backend_connections_total%22%7D&match%5B%5D=%7Bname%3D%22haproxy_backend_current_queue%22%7D&match%5B%5D=%7Bname%3D%22haproxy_backend_http_average_response_latency_milliseconds%22%7D&match%5B%5D=%7Bname%3D%22haproxy_backend_max_sessions%22%7D&match%5B%5D=%7Bname%3D%22haproxy_backend_response_errors_total%22%7D&match%5B%5D=%7Bname%3D%22haproxy_backend_up%22%7D&match%5B%5D=%7Bname%3D%22http_requests_total%22%7D&match%5B%5D=%7Bname%3D%22instance%3Anode_cpu_utilisation%3Arate1m%22%7D&match%5B%5D=%7Bname%3D%22instance%3Anode_filesystem_usage%3Asum%22%7D&match%5B%5D=%7Bname%3D%22instance%3Anode_load1_per_cpu%3Aratio%22%7D&match%5B%5D=%7Bname%3D%22instance%3Anode_memory_utilisation%3Aratio%22%7D&match%5B%5D=%7Bname%3D%22instance%3Anode_network_receive_bytes_excluding_lo%3Arate1m%22%7D&match%5B%5D=%7Bname%3D%22instance%3Anode_network_receive_drop_excluding_lo%3Arate1m%22%7D&match%5B%5D=%7Bname%3D%22instance%3Anode_network_transmit_bytes_excluding_lo%3Arate1m%22%7D&match%5B%5D=%7Bname%3D%22instance%3Anode_network_transmit_drop_excluding_lo%3Arate1m%22%7D&match%5B%5D=%7Bname%3D%22instance%3Anode_num_cpu%3Asum%22%7D&match%5B%5D=%7Bname%3D%22instance%3Anode_vmstat_pgmajfault%3Arate1m%22%7D&match%5B%5D=%7Bname%3D%22instance_device%3Anode_disk_io_time_seconds%3Arate1m%22%7D&match%5B%5D=%7Bname%3D%22instance_device%3Anode_disk_io_time_weighted_seconds%3Arate1m%22%7D&match%5B%5D=%7Bname%3D%22kube_daemonset_status_desired_number_scheduled%22%7D&match%5B%5D=%7Bname%3D%22kube_daemonset_status_number_unavailable%22%7D&match%5B%5D=%7Bname%3D%22kube_node_labels%22%7D&match%5B%5D=%7Bname%3D%22kube_node_role%22%7D&match%5B%5D=%7Bname%3D%22kube_node_spec_unschedulable%22%7D&match%5B%5D=%7Bname%3D%22kube_node_status_allocatable%22%7D&match%5B%5D=%7Bname%3D%22kube_node_status_allocatable_cpu_cores%22%7D&match%5B%5D=%7Bname%3D%22kube_node_status_allocatable_memory_bytes%22%7D&match%5B%5D=%7Bname%3D%22kube_node_status_capacity%22%7D&match%5B%5D=%7Bname%3D%22kube_node_status_capacity_cpu_cores%22%7D&match%5B%5D=%7Bname%3D%22kube_node_status_capacity_pods%22%7D&match%5B%5D=%7Bname%3D%22kube_node_status_condition%22%7D&match%5B%5D=%7Bname%3D%22kube_persistentvolume_status_phase%22%7D&match%5B%5D=%7Bname%3D%22kube_pod_container_resource_limits%22%7D&match%5B%5D=%7Bname%3D%22kube_pod_container_resource_limits_cpu_cores%22%7D&match%5B%5D=%7Bname%3D%22kube_pod_container_resource_limits_memory_bytes%22%7D&match%5B%5D=%7Bname%3D%22kube_pod_container_resource_requests%22%7D&match%5B%5D=%7Bname%3D%22kube_pod_container_resource_requests_cpu_cores%22%7D&match%5B%5D=%7Bname%3D%22kube_pod_container_resource_requests_memory_bytes%22%7D&match%5B%5D=%7Bname%3D%22kube_pod_info%22%7D&match%5B%5D=%7Bname%3D%22kube_pod_owner%22%7D&match%5B%5D=%7Bname%3D%22kube_resourcequota%22%7D&match%5B%5D=%7Bname%3D%22kubelet_running_container_count%22%7D&match%5B%5D=%7Bname%3D%22kubelet_runtime_operations%22%7D&match%5B%5D=%7Bname%3D%22kubelet_runtime_operations_duration_seconds_sum%22%7D&match%5B%5D=%7Bname%3D%22kubelet_volume_stats_available_bytes%22%7D&match%5B%5D=%7Bname%3D%22kubelet_volume_stats_capacity_bytes%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_hco_system_health_status%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_hyperconverged_operator_health_status%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vm_cpu_usage_seconds_total%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vm_create_date_timestamp_seconds%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vm_disk_allocated_size_bytes%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vm_error_status_last_transition_timestamp_seconds%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vm_info%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vm_migrating_status_last_transition_timestamp_seconds%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vm_non_running_status_last_transition_timestamp_seconds%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vm_resource_requests%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vm_running_status_last_transition_timestamp_seconds%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vm_starting_status_last_transition_timestamp_seconds%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vmi_cpu_usage_seconds_total%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vmi_filesystem_capacity_bytes%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vmi_filesystem_used_bytes%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vmi_info%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vmi_memory_available_bytes%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vmi_memory_cached_bytes%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vmi_memory_swap_in_traffic_bytes%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vmi_memory_swap_out_traffic_bytes%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vmi_memory_unused_bytes%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vmi_memory_used_bytes%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vmi_migration_end_time_seconds%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vmi_migration_succeeded%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vmi_network_receive_bytes_total%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vmi_network_receive_packets_dropped_total%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vmi_network_receive_packets_total%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vmi_network_transmit_bytes_total%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vmi_network_transmit_packets_dropped_total%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vmi_network_transmit_packets_total%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vmi_phase_count%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vmi_status_addresses%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vmi_storage_iops_read_total%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vmi_storage_iops_write_total%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vmi_storage_read_traffic_bytes_total%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vmi_storage_write_traffic_bytes_total%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vmi_vcpu_delay_seconds_total%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vmi_vcpu_wait_seconds_total%22%7D&match%5B%5D=%7Bname%3D%22kubevirt_vmsnapshot_succeeded_timestamp_seconds%22%7D&match%5B%5D=%7Bname%3D%22machine_cpu_cores%22%7D&match%5B%5D=%7Bname%3D%22machine_memory_bytes%22%7D&match%5B%5D=%7Bname%3D%22mce_hs_addon_available_hosted_clusters_gauge%22%7D&match%5B%5D=%7Bname%3D%22mce_hs_addon_available_hosted_control_planes_gauge%22%7D&match%5B%5D=%7Bname%3D%22mce_hs_addon_average_qps_based_hcp_capacity_gauge%22%7D&match%5B%5D=%7Bname%3D%22mce_hs_addon_deleted_hosted_clusters_gauge%22%7D&match%5B%5D=%7Bname%3D%22mce_hs_addon_high_qps_based_hcp_capacity_gauge%22%7D&match%5B%5D=%7Bname%3D%22mce_hs_addon_hosted_control_planes_status_gauge%22%7D&match%5B%5D=%7Bname%3D%22mce_hs_addon_hypershift_operator_degraded_bool%22%7D&match%5B%5D=%7Bname%3D%22mce_hs_addon_low_qps_based_hcp_capacity_gauge%22%7D&match%5B%5D=%7Bname%3D%22mce_hs_addon_medium_qps_based_hcp_capacity_gauge%22%7D&match%5B%5D=%7Bname%3D%22mce_hs_addon_qps_based_hcp_capacity_gauge%22%7D&match%5B%5D=%7Bname%3D%22mce_hs_addon_qps_gauge%22%7D&match%5B%5D=%7Bname%3D%22mce_hs_addon_request_based_hcp_capacity_current_gauge%22%7D&match%5B%5D=%7Bname%3D%22mce_hs_addon_request_based_hcp_capacity_gauge%22%7D&match%5B%5D=%7Bname%3D%22mce_hs_addon_total_hosted_control_planes_gauge%22%7D&match%5B%5D=%7Bname%3D%22mce_hs_addon_worker_node_resource_capacities_gauge%22%7D&match%5B%5D=%7Bname%3D%22mixin_pod_workload%22%7D&match%5B%5D=%7Bname%3D%22namespace%3Acontainer_memory_usage_bytes%3Asum%22%7D&match%5B%5D=%7Bname%3D%22namespace%3Akube_pod_container_resource_requests_cpu_cores%3Asum%22%7D&match%5B%5D=%7Bname%3D%22namespace_cpu%3Akube_pod_container_resource_requests%3Asum%22%7D&match%5B%5D=%7Bname%3D%22namespace_memory%3Akube_pod_container_resource_requests%3Asum%22%7D&match%5B%5D=%7Bname%3D%22namespace_workload_pod%3Akube_pod_owner%3Arelabel%22%7D&match%5B%5D=%7Bname%3D%22node_cpu_seconds_total%22%7D&match%5B%5D=%7Bname%3D%22node_cpu_seconds_total%22%7D&match%5B%5D=%7Bname%3D%22node_filesystem_avail_bytes%22%7D&match%5B%5D=%7Bname%3D%22node_filesystem_free_bytes%22%7D&match%5B%5D=%7Bname%3D%22node_filesystem_size_bytes%22%7D&match%5B%5D=%7Bname%3D%22node_memory_MemAvailable_bytes%22%7D&match%5B%5D=%7Bname%3D%22node_memory_MemTotal_bytes%22%7D&match%5B%5D=%7Bname%3D%22node_memory_MemTotal_bytes%22%7D&match%5B%5D=%7Bname%3D%22node_namespace_pod_container%3Acontainer_cpu_usage_seconds_total%3Asum_irate%22%7D&match%5B%5D=%7Bname%3D%22node_namespace_pod_container%3Acontainer_cpu_usage_seconds_total%3Asum_rate%22%7D&match%5B%5D=%7Bname%3D%22node_netstat_TcpExt_TCPSynRetrans%22%7D&match%5B%5D=%7Bname%3D%22node_netstat_Tcp_OutSegs%22%7D&match%5B%5D=%7Bname%3D%22node_netstat_Tcp_RetransSegs%22%7D&match%5B%5D=%7Bname%3D%22policy%3Apolicy_governance_info%3Apropagated_count%22%7D&match%5B%5D=%7Bname%3D%22policy%3Apolicy_governance_info%3Apropagated_noncompliant_count%22%7D&match%5B%5D=%7Bname%3D%22policyreport_info%22%7D&match%5B%5D=%7Bname%3D%22process_cpu_seconds_total%22%2Cjob%3D%22apiserver%22%7D&match%5B%5D=%7Bname%3D%22process_resident_memory_bytes%22%2Cjob%3D~%22apiserver%7Cetcd%22%7D&match%5B%5D=%7Bname%3D%22prometheus_operator_reconcile_errors_total%22%7D&match%5B%5D=%7Bname%3D%22prometheus_operator_reconcile_operations_total%22%7D&match%5B%5D=%7Bname%3D%22up%22%7D&match%5B%5D=%7Bname%3D%22workqueue_adds_total%22%2Cjob%3D%22apiserver%22%7D&match%5B%5D=%7Bname%3D%22workqueue_depth%22%2Cjob%3D%22apiserver%22%7D&match%5B%5D=%7Bname_%3D%22workqueue_queue_duration_seconds_bucket%22%2Cjob%3D%22apiserver%22%7D\": context deadline exceeded" level=debug caller=logger.go:45 ts=2025-11-25T21:42:26.876809531Z component=forwarder component=metricsclient timeseriesnumber=3214 level=info caller=logger.go:50 ts=2025-11-25T21:42:27.013542172Z component=forwarder component=metricsclient msg="metrics pushed successfully" mc1@d26-h06-000-r650 ~#