-
Bug
-
Resolution: Cannot Reproduce
-
Critical
-
None
-
None
-
None
job link
this is coming in the test case about "Check if alerts are firing during or after upgrade success" and the test log snippet is here:
{May 4 09:58:01.856: Unexpected alerts fired or pending during the upgrade: alert TargetDown fired for 30 seconds with labels: {job="dns-default", namespace="openshift-dns", service="dns-default", severity="warning"} alert TargetDown fired for 60 seconds with labels: {job="network-metrics-service", namespace="openshift-multus", service="network-metrics-service", severity="warning"} alert TargetDown fired for 60 seconds with labels: {job="ovnkube-node", namespace="openshift-ovn-kubernetes", service="ovn-kubernetes-node", severity="warning"} alert TargetDown fired for 90 seconds with labels: {job="node-exporter", namespace="openshift-monitoring", service="node-exporter", severity="warning"} Failure May 4 09:58:01.856: Unexpected alerts fired or pending during the upgrade: alert TargetDown fired for 30 seconds with labels: {job="dns-default", namespace="openshift-dns", service="dns-default", severity="warning"} alert TargetDown fired for 60 seconds with labels: {job="network-metrics-service", namespace="openshift-multus", service="network-metrics-service", severity="warning"} alert TargetDown fired for 60 seconds with labels: {job="ovnkube-node", namespace="openshift-ovn-kubernetes", service="ovn-kubernetes-node", severity="warning"} alert TargetDown fired for 90 seconds with labels: {job="node-exporter", namespace="openshift-monitoring", service="node-exporter", severity="warning"} github.com/openshift/origin/test/extended/util/disruption.(*chaosMonkeyAdapter).Test(0xc001790e10, 0xc0009c3110) github.com/openshift/origin/test/extended/util/disruption/disruption.go:192 +0x32f k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do.func1() k8s.io/kubernetes@v1.23.0/test/e2e/chaosmonkey/chaosmonkey.go:90 +0x6a created by k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do k8s.io/kubernetes@v1.23.0/test/e2e/chaosmonkey/chaosmonkey.go:87 +0x8c}
["oc get pods"|] that is captured at the end of the job looks like it might show that the pods related to the
above alerts have some restart counts that we might not expect. It appears that the alerts started firing
toward the end of the final node was coming back up from it's upgrade reboot (FWIW).
link to this job's testgrid for reference.