Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-36808

Metrics API disruption regressed during node reboot with aws ovn upgrade job

XMLWordPrintable

    • None
    • MON Sprint 258, MON Sprint 259
    • 2
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      TRT got an alert for a disruption regression for metrics-api backend. It seems that 95 percentile is getting 5s worse than 4.16 GA data. 
      
      Here is a dashboard where you can view the trend:
      
      https://grafana-loki.ci.openshift.org/d/ISnBj4LVk/disruption?orgId=1&var-percentile=P95&var-platform=aws&var-backend=metrics-api-new-connections&var-upgrade_type=minor&var-master_nodes_updated=Y&var-architectures=amd64&var-topologies=ha&var-networks=ovn&var-releases=4.17
      
      You can also scroll down to the "Last 500 Job" to see the list of jobs with their disruption count. You can click on one of them to go to prow job to investigate further. 
      
      Here is an example job: 
      
      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1810404925734129664
      
      It seems that the disruption is always happening when one of the masters is being rebooted. 
      
      
      
          

      How reproducible:

          

      Steps to Reproduce:

          1.
          2.
          3.
          

      Actual results:

          

      Expected results:

          

      Additional info:

          

              rh-ee-amrini Ayoub Mrini
              kenzhang@redhat.com Ken Zhang
              Junqi Zhao Junqi Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: