Uploaded image for project: 'OpenShift Core Networking'
  1. OpenShift Core Networking
  2. CORENET-2207

TargetDown alerts firing for multiple services

XMLWordPrintable

    • None
    • None
    • SDN Sprint 219
    • None
    • None
    • None

      job link

      must-gather

      this is coming in the test case about "Check if alerts are firing during or after upgrade success" and the test log snippet is here:

      {May  4 09:58:01.856: Unexpected alerts fired or pending during the upgrade:
      
      alert TargetDown fired for 30 seconds with labels: {job="dns-default", namespace="openshift-dns", service="dns-default", severity="warning"}
      alert TargetDown fired for 60 seconds with labels: {job="network-metrics-service", namespace="openshift-multus", service="network-metrics-service", severity="warning"}
      alert TargetDown fired for 60 seconds with labels: {job="ovnkube-node", namespace="openshift-ovn-kubernetes", service="ovn-kubernetes-node", severity="warning"}
      alert TargetDown fired for 90 seconds with labels: {job="node-exporter", namespace="openshift-monitoring", service="node-exporter", severity="warning"} Failure May  4 09:58:01.856: Unexpected alerts fired or pending during the upgrade:
      
      alert TargetDown fired for 30 seconds with labels: {job="dns-default", namespace="openshift-dns", service="dns-default", severity="warning"}
      alert TargetDown fired for 60 seconds with labels: {job="network-metrics-service", namespace="openshift-multus", service="network-metrics-service", severity="warning"}
      alert TargetDown fired for 60 seconds with labels: {job="ovnkube-node", namespace="openshift-ovn-kubernetes", service="ovn-kubernetes-node", severity="warning"}
      alert TargetDown fired for 90 seconds with labels: {job="node-exporter", namespace="openshift-monitoring", service="node-exporter", severity="warning"}
      
      github.com/openshift/origin/test/extended/util/disruption.(*chaosMonkeyAdapter).Test(0xc001790e10, 0xc0009c3110)
      	github.com/openshift/origin/test/extended/util/disruption/disruption.go:192 +0x32f
      k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do.func1()
      	k8s.io/kubernetes@v1.23.0/test/e2e/chaosmonkey/chaosmonkey.go:90 +0x6a
      created by k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do
      	k8s.io/kubernetes@v1.23.0/test/e2e/chaosmonkey/chaosmonkey.go:87 +0x8c}
      

      ["oc get pods"|] that is captured at the end of the job looks like it might show that the pods related to the
      above alerts have some restart counts that we might not expect. It appears that the alerts started firing
      toward the end of the final node was coming back up from it's upgrade reboot (FWIW).

      link to this job's testgrid for reference.

       

              mmahmoud@redhat.com Mohamed Mahmoud (Inactive)
              jluhrsen Jamo Luhrsen
              None
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: