Uploaded image for project: 'Multiple Architecture Enablement'
  1. Multiple Architecture Enablement
  2. MULTIARCH-967

[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured: Prometheus query error

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 4.7.z
    • 4.7
    • Multi-Arch
    • None
    • False
    • False
    • NEW
    • NEW
    • Undefined
    • Sprint 10: 03/22 - 04/10

      +++ This bug was initially created as a clone of Bug #1936488 +++

      test:
      [sig-instrumentation][Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured

      is failing frequently in CI, see search results:
      https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-instrumentation%5C%5D%5C%5BLate%5C%5D+Alerts+shouldn%27t+report+any+alerts+in+firing+state+apart+from+Watchdog+and+AlertmanagerReceiversNotConfigured

      Seeing a lot of failures with the queries used.
      https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-remote-libvirt-image-ecosystem-ppc64le-4.8/1368381121678544896

      https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt/1368886803049746432

      Test grids show that these errors have been popping up for a long time
      https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#release-openshift-origin-installer-e2e-remote-libvirt-image-ecosystem-ppc64le-4.8

      https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt

      — Additional comment from Simon Pasquier on 2021-03-08 09:33:51 CST —

      Looking at release-openshift-origin-installer-e2e-remote-libvirt-image-ecosystem-ppc64le-4.8 [1], there's a problem with DNS resolution for Alertmanager pods [2] which leads to "AlertmanagerMembersInconsistent" being fired. It should be redirected to the teams dealing with libvirt and/or ppc64 platforms because it's not something we see in other environments.

      Looking at periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt [3], "TargetDown" is firing for the node-exporter and kubelet targets. It means that for some reason, Prometheus fails to scrape metrics and I can see some failures in node_exporter's kube-rbac-proxy [4][5][6] that would match. Again since it is specific to a given job, it would be best redirected to the folks in charge of this job.

      [1] https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-remote-libvirt-image-ecosystem-ppc64le-4.8/1368381121678544896
      [2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-remote-libvirt-image-ecosystem-ppc64le-4.8/1368381121678544896/artifacts/e2e-remote-libvirt/pods/openshift-monitoring_alertmanager-main-0_alertmanager.log
      [3] https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt/1368886803049746432
      [4] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt/1367656589833539584/artifacts/e2e-gcp-rt/gather-extra/artifacts/pods/openshift-monitoring_node-exporter-dnw48_kube-rbac-proxy.log
      [5] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt/1367656589833539584/artifacts/e2e-gcp-rt/gather-extra/artifacts/pods/openshift-monitoring_node-exporter-5jz89_node-exporter.log
      [6] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-gcp-rt/1367656589833539584/artifacts/e2e-gcp-rt/gather-extra/artifacts/pods/openshift-monitoring_node-exporter-grgw8_kube-rbac-proxy.log

      — Additional comment from Prashanth Sundararaman on 2021-03-09 19:42:22 CST —

      i see this issue on any libvirt deploy on baremetal..not specific to multi-arch, where there is this alert firing:

      33.33% of the machine-api-controllers/machine-api-controllers targets in NamespaceNSopenshift-machine-api namespace are down.

      this is due to the kube-rbac-proxy-machine-mtrc container reporting these errors:

      I0309 17:29:57.278737 1 main.go:159] Reading config file: /etc/kube-rbac-proxy/config-file.yaml
      I0309 17:29:57.285997 1 main.go:190] Valid token audiences:
      I0309 17:29:57.286071 1 main.go:278] Reading certificate files
      I0309 17:29:57.286336 1 main.go:311] Starting TCP socket on 0.0.0.0:8441
      I0309 17:29:57.287013 1 main.go:318] Listening securely on 0.0.0.0:8441
      2021/03/09 17:43:04 http: proxy error: dial tcp [::1]:8081: connect: connection refused
      2021/03/09 17:43:12 http: proxy error: dial tcp [::1]:8081: connect: connection refused
      2021/03/09 17:43:33 http: proxy error: dial tcp [::1]:8081: connect: connection refused
      2021/03/09 17:43:42 http: proxy error: dial tcp [::1]:8081: connect: connection refused
      2021/03/09 17:44:03 http: proxy error: dial tcp [::1]:8081: connect: connection refused
      2021/03/09 17:44:12 http: proxy error: dial tcp [::1]:8081: connect: connection refused
      2021/03/09 17:44:33 http: proxy error: dial tcp [::1]:8081: connect: connection refused
      2021/03/09 17:44:42 http: proxy error: dial tcp [::1]:8081: connect: connection refused
      2021/03/09 17:45:03 http: proxy error: dial tcp [::1]:8081: connect: connection refused

      — Additional comment from Andy McCrae on 2021-03-10 04:13:11 CST —

      This looks to be happening because a change to metrics went in to the machine-api-operator, and the required provider change was not applied to the libvirt provider:

      https://github.com/openshift/machine-api-operator/pull/609

      This will impact all branches going back to 4.6 - I'm looking into applying the change to the libvirt provider now.

      — Additional comment from Dan Li on 2021-03-11 07:13:45 CST —

      Hi Andy or Prashanth, as part of bug triaging, can we provide a "Severity" for this bug?

      — Additional comment from Andy McCrae on 2021-03-11 10:24:10 CST —

      I've marked this as Medium - This should only impact CI since the libvirt-provider isn't used in supported environments, but none-the-less it impairs out ability to perform accurate CI runs.

      We have a PR up for master, but we will need to shepherd this through to 4.6: https://github.com/openshift/cluster-api-provider-libvirt/pull/218

      — Additional comment from Yaakov Selkowitz on 2021-03-12 12:57:50 CST —

      (In reply to Andy McCrae from comment #5)
      > We have a PR up for master, but we will need to shepherd this through to
      > 4.6: https://github.com/openshift/cluster-api-provider-libvirt/pull/218

      Automatic cherry-pick failed, so we'll need manual PRs for 4.7 and 4.6.

            psundararaman Prashanth Sundararaman (Inactive)
            yselkowi@redhat.com Yaakov Selkowitz
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: