Uploaded image for project: 'OpenShift SDN'
  1. OpenShift SDN
  2. SDN-3024

TargetDown alerts firing for multiple services

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: Critical Critical
    • None
    • None
    • None
    • SDN Sprint 219
    • 0
    • 0

      job link

      must-gather

      this is coming in the test case about "Check if alerts are firing during or after upgrade success" and the test log snippet is here:

      {May  4 09:58:01.856: Unexpected alerts fired or pending during the upgrade:
      
      alert TargetDown fired for 30 seconds with labels: {job="dns-default", namespace="openshift-dns", service="dns-default", severity="warning"}
      alert TargetDown fired for 60 seconds with labels: {job="network-metrics-service", namespace="openshift-multus", service="network-metrics-service", severity="warning"}
      alert TargetDown fired for 60 seconds with labels: {job="ovnkube-node", namespace="openshift-ovn-kubernetes", service="ovn-kubernetes-node", severity="warning"}
      alert TargetDown fired for 90 seconds with labels: {job="node-exporter", namespace="openshift-monitoring", service="node-exporter", severity="warning"} Failure May  4 09:58:01.856: Unexpected alerts fired or pending during the upgrade:
      
      alert TargetDown fired for 30 seconds with labels: {job="dns-default", namespace="openshift-dns", service="dns-default", severity="warning"}
      alert TargetDown fired for 60 seconds with labels: {job="network-metrics-service", namespace="openshift-multus", service="network-metrics-service", severity="warning"}
      alert TargetDown fired for 60 seconds with labels: {job="ovnkube-node", namespace="openshift-ovn-kubernetes", service="ovn-kubernetes-node", severity="warning"}
      alert TargetDown fired for 90 seconds with labels: {job="node-exporter", namespace="openshift-monitoring", service="node-exporter", severity="warning"}
      
      github.com/openshift/origin/test/extended/util/disruption.(*chaosMonkeyAdapter).Test(0xc001790e10, 0xc0009c3110)
      	github.com/openshift/origin/test/extended/util/disruption/disruption.go:192 +0x32f
      k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do.func1()
      	k8s.io/kubernetes@v1.23.0/test/e2e/chaosmonkey/chaosmonkey.go:90 +0x6a
      created by k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do
      	k8s.io/kubernetes@v1.23.0/test/e2e/chaosmonkey/chaosmonkey.go:87 +0x8c}
      

      ["oc get pods"|] that is captured at the end of the job looks like it might show that the pods related to the
      above alerts have some restart counts that we might not expect. It appears that the alerts started firing
      toward the end of the final node was coming back up from it's upgrade reboot (FWIW).

      link to this job's testgrid for reference.

       

            mmahmoud@redhat.com Mohamed Mahmoud
            jluhrsen Jamo Luhrsen
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: