Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-53393

[FLAKE] Test Missing metrics in kube-state-metrics fails randomly

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.19
    • Test Framework
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

          The test called "[sig-monitoring] Cluster_Observability parallel monitoring Author:tagao-LEVEL0-Medium-55767-Missing metrics in kube-state-metrics" is error-prone and fails randomly in CI. 
          The reason is: The test queries Thanos Querier for available metrics and expects "kube_pod_container_status_terminated_reason" to be present among the metrics. However, this metric is only returned when there are some Pods in the cluster in "Terminated" state.

      Version-Release number of selected component (if applicable):

          4.19

      How reproducible:

      Always

      Steps to Reproduce:

          1. Make sure the cluster doesn't have any Pods in Terminated state
          2. Run extended-platform-tests run-test "[sig-monitoring] Cluster_Observability parallel monitoring Author:tagao-LEVEL0-Medium-55767-Missing metrics in kube-state-metrics"
          3. The test fails with the result below.

      Actual results:

      The test fails with:

        {  fail [github.com/openshift/openshift-tests-private/test/extended/util/assert.go:30]: Unexpected error:
          <*errors.errorString | 0xc0017a43e0>: 
          case: [sig-monitoring] Cluster_Observability parallel monitoring Author:tagao-LEVEL0-Medium-55767-Missing metrics in kube-state-metrics
          error: The metrics {"status":"success","data":["ALERTS","ALERTS_FOR_STATE","aggregator_discovery_aggregation_count_total","aggregator_unavailable_apiservice","apiextensions_apiserver_validation_ratcheting_seconds_bucket","apiextensions_apiserver_validation_ratcheting_seconds_count","apiextensions_apiserver_validation_ratcheting_seconds_sum","apiextensions_openapi_v2_regeneration_count","apiextensions_openapi_v3_regeneration_count","apiserver_admission_controller_admission_duration_seconds_bucket","apiserver_admission_controller_admission_duration_seconds_count","apiserver_admission_controller_admission_duration_seconds_sum","apiserver_admission_match_condition_evaluation_seconds_bucket","apiserver_admission_match_condition_evaluation_seconds_count","apiserver_admission_match_condition_evaluation_seconds_sum","apiserver_admission_match_condition_exclusions_total","apiserver_admission_step_admission_duration_seconds_bucket","apiserver_admission_step_admission_duration_seconds_count","apiserver_admission_step_admission_duration_seconds_sum","apiserver_admission_step_admission_duration_seconds_summary","apiserver_admission_step_admission_duration_seconds_summary_count","apiserver_admission_step_admission_duration_seconds_summary_sum","apiserver_admission_webhook_admission_duration_seconds_bucket","apiserver_admission_webhook_admission_duration_seconds_count","apiserver_admission_webhook_admission_duration_seconds_sum","apiserver_admission_webhook_request_total","apiserver_audit_event_total","apiserver_audit_level_total","apiserver_audit_requests_rejected_total","apiserver_authorization_decisions_total","apiserver_cache_list_fetched_objects_total","apiserver_cache_list_returned_objects_total","apiserver_cache_list_total","apiserver_cel_compilation_duration_seconds_bucket","apiserver_cel_compilation_duration_seconds_count","apiserver_cel_compilation_duration_seconds_sum","apiserver_cel_evaluation_duration_seconds_bucket","apiserver_cel_evaluation_duration_seconds_count","apiserver_cel_evaluation_duration_seconds_sum","apiserver_certificates_registry_csr_honored_duration_total","apiserver_certificates_registry_csr_requested_duration_total","apiserver_client_certificate_expiration_seconds_bucket","apiserver_client_certificate_expiration_seconds_count","apiserver_client_certificate_expiration_seconds_sum","apiserver_clusterip_repair_ip_errors_total","apiserver_clusterip_repair_reconcile_errors_total","apiserver_conversion_webhook_duration_seconds_bucket","apiserver_conversion_webhook_duration_seconds_count","apiserver_conversion_webhook_duration_seconds_sum","apiserver_conversion_webhook_request_total","apiserver_current_inflight_requests","apiserver_current_inqueue_requests","apiserver_delegated_authn_request_duration_seconds_bucket","apiserver_delegated_authn_request_duration_seconds_count","apiserver_delegated_authn_request_duration_seconds_sum","apiserver_delegated_authn_request_total","apiserver_egress_dialer_dial_duration_seconds_bucket","apiserver_egress_dialer_dial_duration_seconds_count","apiserver_egress_dialer_dial_duration_seconds_sum","apiserver_egress_dialer_dial_start_total","apiserver_envelope_encryption_dek_cache_fill_percent","apiserver_flowcontrol_current_executing_requests","apiserver_flowcontrol_current_executing_seats","apiserver_flowcontrol_current_inqueue_requests","apiserver_flowcontrol_current_inqueue_seats","apiserver_flowcontrol_current_limit_seats","apiserver_flowcontrol_current_r","apiserver_flowcontrol_demand_seats_average","apiserver_flowcontrol_demand_seats_bucket","apiserver_flowcontrol_demand_seats_count","apiserver_flowcontrol_demand_seats_high_watermark","apiserver_flowcontrol_demand_seats_smoothed","apiserver_flowcontrol_demand_seats_stdev","apiserver_flowcontrol_demand_seats_sum","apiserver_flowcontrol_dispatch_r","apiserver_flowcontrol_dispatched_requests_total","apiserver_flowcontrol_latest_s","apiserver_flowcontrol_lower_limit_seats","apiserver_flowcontrol_next_discounted_s_bounds","apiserver_flowcontrol_next_s_bounds","apiserver_flowcontrol_nominal_limit_seats","apiserver_flowcontrol_priority_level_request failed to contain "kube_pod_container_status_terminated_reason"
      
      On the other hand, the metric "kube_pod_container_status_last_terminated_reason" is there because it looks at all pods and their history of the "terminated reason". So, this metric doesn't need any pods to be in Terminated state.

      Expected results:

          The test passes always passes.

      Additional info:

          The failure was seen in this run: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/62770/rehearse-62770-periodic-ci-openshift-openshift-tests-private-release-4.19-amd64-nightly-aws-ipi-cilium-hypershift-guest-f7/1902276515924545536
      
         This was exected as part of CI checks for this pull request: https://github.com/openshift/release/pull/62770
         Source for the test can be found here: https://github.com/openshift/openshift-tests-private/blob/master/test/extended/monitoring/monitoring.go#L243
      
      Additional verification:
      
      1) Print metrics from kube-state-metrics Pod shows that the metric "kube_pod_container_status_terminated_reason" is registered but there are no Pods that would fill the metric (it's empty):
      
      ᐅ oc exec -ti $(oc get pod -l app.kubernetes.io/name=kube-state-metrics -o name) -c kube-state-metrics -- curl http://localhost:8081/metrics
      
      ...
      # HELP kube_pod_container_status_terminated_reason Describes the reason the container is currently in terminated state.
      # TYPE kube_pod_container_status_terminated_reason gauge
      # HELP kube_pod_container_status_waiting [STABLE] Describes whether the container is currently in waiting state.
      # TYPE kube_pod_container_status_waiting gauge
      kube_pod_container_status_waiting{namespace="cilium",pod="cilium-z5www",uid="5ed5cfed-135e-4676-ae11-75289e942785",container="cilium-agent"} 0
      kube_pod_container_status_waiting{namespace="openshift-multus",pod="multus-smcdh",uid="a25e6f18-f849-42e3-8ef8-5ab9f8a702e6",container="kube-multus"} 0
      ...
      
      2) The query (same as in the test) doesn't have the metric:
      
      ᐅ oc rsh -n openshift-monitoring prometheus-k8s-0 sh -c 'curl -G -k -s -H "Authorization:Bearer <bearer_token" https://thanos-querier.openshift-monitoring.svc:9091/api/v1/label/__name__/values' > results.txt 
      
      ᐅ grep kube_pod_container_status_terminated_reason results.txt | wc -l
      0
      
      3) Deploy a job that runs once and then the pod is in Terminated state:
      ᐅ oc apply -f job-pi.yaml -n openshift-monitoring
      
      4) Verify kube-state-metrics actually registers the Pod and the reason is "Completed"
      ᐅ oc exec -ti $(oc get pod -l app.kubernetes.io/name=kube-state-metrics -o name) -c kube-state-metrics -- curl http://localhost:8081/metrics
      
      ...
      # HELP kube_pod_container_status_terminated_reason Describes the reason the container is currently in terminated state.
      # TYPE kube_pod_container_status_terminated_reason gauge
      kube_pod_container_status_terminated_reason{namespace="openshift-monitoring",pod="pi-74jf4",uid="68a73b0c-d952-4b59-9b50-39c14a69e0a0",container="pi",reason="Completed"} 1
      ...
      
      
      5) Check that the metric is there when querying Prometheus:
      ᐅ oc rsh -n openshift-monitoring prometheus-k8s-0 sh -c 'curl -G -k -s -H "Authorization:Bearer <bearer_token" https://thanos-querier.openshift-monitoring.svc:9091/api/v1/label/__name__/values' > results.txt
      
      ᐅ grep kube_pod_container_status_terminated_reason results.txt | wc -l
      1

       

              tagao@redhat.com Tai Gao
              mgencur@redhat.com Martin Gencur
              None
              None
              Tai Gao Tai Gao
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: