Uploaded image for project: 'OCP Technical Release Team'
  1. OCP Technical Release Team
  2. TRT-1794

Write monitortest for catching disruption near the start of first e2e test

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • None
    • False
    • None
    • False

      Related to a component regression we found that looked like we had no clear test to catch, sample runs:

      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-metal-ipi-ovn-kube-apiserver-rollout/1827763939853733888

      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-metal-ipi-ovn-ipv4/1826908352773361664

      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.17-e2e-metal-ipi-ovn-dualstack/1828844069434953728

      All three runs show a pattern. The actual test failures look unpredictable, some tests are passing at the same time, others fail to talk to the apiserver.

      The pattern we see is 1 or more tests failing right at the start of e2e testing, disruption, etcd log messages indicating slowness, and etcd leadership state changes.

      Because the tests are unpredictable, we'd like a test that catches this symptom. We think the safest way to do this is to look for disruption within x minutes of the first e2e test.

      This would be implemented as a monitortest, likely somewhere around here: https://github.com/openshift/origin/blob/master/pkg/monitortests/kubeapiserver/legacykubeapiservermonitortests/monitortest.go

      Although it would be reasonable to add a new monitortest in the parent package above this level.

      The test would need to do the following:

      • scan final intervals for the earliest interval with source=SourceE2ETest (constant in monitorapi/types.go), save it's start time
      • scan final intervals for those with source=SourceDisruption, and reason=DisruptionBegan, and a backend matching one of the apiservers (kube, openshift, oauth)
      • flake the test (return a failure junit result + a success junit result) if we see any SourceDisruption intervals within X minutes of that first e2e test.
      • Choose X based on what we see in the above links.

              lmeyer@redhat.com Luke Meyer
              rhn-engineering-dgoodwin Devan Goodwin
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: