Uploaded image for project: 'Knative Serving'
  1. Knative Serving
  2. SRVKS-776

AutoscaleSustainingWithTBCTest failing during upgrade from 1.15 to 1.16

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 1.16.0
    • 1.16.0
    • None
    • False
    • False
    • Undefined

      The test fails as follows:

      autoscaler.go:87: Error:  request success rate under SLO: total = 29212, errors = 53, rate = 0.998186, SLO = 0.999000
      

      And it throws errors like this one during the run:

      2021-06-30T05:18:23.015-0400	INFO	e2e/autoscale.go:164	Status = 502, want: 200
      2021-06-30T05:18:23.015-0400	INFO	e2e/autoscale.go:165	URL: http://serverless-upgrade-c-olrfphrl-serving-tests.apps.ocf-rollup-1-16-11-rolling-upgrade-1.16.openshift-aws.rhocf-dev.net?sleep=500 Start: 2021-06-30T05:18:22-04:00 End: 2021-06-30T05:18:23-04:00 Duration: 79.75504ms Error: 502 Bad Gateway Body:
      dial tcp 10.128.2.63:8012: connect: connection refused
      

      The full test log can be found in this job: https://master-jenkins-csb-serverless-qe.apps.ocp4.prod.psi.redhat.com/job/functional_tests/job/stream1_16/job/rolling-upgrade-1.16/11/console

      When upgrading from Serving 0.21. to 0.22 the queu-proxy container fails with

      25m         Warning   Unhealthy                      pod/serverless-upgrade-c-olrfphrl-00001-deployment-c54b46dcd-g4fqx     Startup probe failed: flag provided but not defined: -probe-period
      Usage of /ko-app/queue:
        -probe-timeout duration
                  run startup probe with given timeout (default -1ns)
      

      The containers are restarted but some of the container endpoints remain there for a longer time among the notReady containers and requests to those containers produce the errors above:

      {"severity":"DEBUG","timestamp":"2021-06-30T09:17:32.390327728Z","logger":"activator","caller":"net/revision_backends.go:346","message":"Revision state","knative.dev/controller":"activator","knative.dev/pod":"activator-6b498d855d-jk7lk","knative.dev/key":"serving-tests/serverless-upgrade-c-olrfphrl-00001","dests":{"ready":"10.131.0.67:8012,10.131.0.68:8012,10.131.0.69:8012,10.129.2.46:8012,10.130.2.52:8012,10.130.2.53:8012,10.130.2.54:8012,10.128.2.52:8012,10.128.2.53:8012,10.129.2.49:8012","notReady":"10.129.2.78:8012"}
      

      In the end these endpoints are removed from the notReady list but the test fails because it has some threshold for failures/successes which is 0.999
      The activator pods are logging this warning:

      {"severity":"WARNING","timestamp":"2021-06-30T09:17:32.390292207Z","logger":"activator","caller":"net/revision_backends.go:286","message":"Failed probing pods","knative.dev/controller":"activator","knative.dev/pod":"activator-6b498d855d-jk7lk","knative.dev/key":"serving-tests/serverless-upgrade-c-olrfphrl-00001","curDests":{"ready":"10.130.2.54:8012,10.128.2.52:8012,10.128.2.53:8012,10.129.2.49:8012,10.131.0.67:8012,10.131.0.68:8012,10.131.0.69:8012,10.129.2.46:8012,10.130.2.52:8012,10.130.2.53:8012","notReady":"10.129.2.78:8012"},"error":"unexpected body: want \"queue\", got \"\""}
      

              markusthoemmes Markus Thömmes (Inactive)
              mgencur Martin Gencur
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: