Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-16615

Prometheus reporting telemetry test intermittent failures due to server side rate limiting

XMLWordPrintable

    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Release Note Not Required

      Description of problem:

      The TRT ComponentReadiness tool shows what looks like a regression (https://sippy.dptools.openshift.org/sippy-ng/component_readiness/test_details?arch=amd64&baseEndTime=2023-05-16%2023%3A59%3A59&baseRelease=4.13&baseStartTime=2023-04-16%2000%3A00%3A00&capability=Other&component=Monitoring&confidence=95&environment=ovn%20no-upgrade%20amd64%20aws%20hypershift&excludeArches=heterogeneous%2Carm64%2Cppc64le%2Cs390x&groupBy=cloud%2Carch%2Cnetwork&ignoreDisruption=true&ignoreMissing=false&minFail=3&network=ovn&pity=5&platform=aws&sampleEndTime=2023-07-20%2023%3A59%3A59&sampleRelease=4.14&sampleStartTime=2023-07-13%2000%3A00%3A00&testId=openshift-tests%3A79898d2e28b78374d89e10b38f88107b&testName=%5Bsig-instrumentation%5D%20Prometheus%20%5Bapigroup%3Aimage.openshift.io%5D%20when%20installed%20on%20the%20cluster%20should%20report%20telemetry%20%5BLate%5D%20%5BSkipped%3ADisconnected%5D%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D&upgrade=no-upgrade&variant=hypershift)
      
      in the "[sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster should report telemetry [Late] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]" test.
      
      In the ComponentReadiness link above, you can see the sample runs (linked with red "F").

      Version-Release number of selected component (if applicable):

      4.14

      How reproducible:

      The pass rate in 4.13 is 100% vs. 81% in 4.14

      Steps to Reproduce:

      1.  There query above focuses on "periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance" jobs and the specific test mentioned.  You can see the failures by clicking on the red "F"s
      2.
      3.
      

      Actual results:

      The failures look like:
      
      {  fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:365]: Unexpected error:
          <errors.aggregate | len:2, cap:2>: 
          [promQL query returned unexpected results:
          metricsclient_request_send{client="federate_to",job="telemeter-client",status_code="200"} >= 1
          [], promQL query returned unexpected results:
          federate_samples{job="telemeter-client"} >= 10
          []]
          [
              <*errors.errorString | 0xc0017611b0>{
                  s: "promQL query returned unexpected results:\nmetricsclient_request_send{client=\"federate_to\",job=\"telemeter-client\",status_code=\"200\"} >= 1\n[]",
              },
              <*errors.errorString | 0xc00203d380>{
                  s: "promQL query returned unexpected results:\nfederate_samples{job=\"telemeter-client\"} >= 10\n[]",
              },
          ]

      Expected results:

      Query should succeed

      Additional info:

      I set the severity to Major because this looks like a regression from where it was in the 5 weeks before 4.13 went GA.

            rh-ee-dmistry Deep Mistry
            dperique@redhat.com Dennis Periquet
            Junqi Zhao Junqi Zhao
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: