Uploaded image for project: 'OpenShift SDN'
  1. OpenShift SDN
  2. SDN-3022

ClusterOperatorDown alert firing

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: Major Major
    • None
    • None
    • None
    • 0
    • 0

      job link

      must-gather

      e2e log

      this is coming in the test case about "Check if alerts are firing during or after upgrade success" and the test log snippet is here:

      {May  4 09:57:07.721: Unexpected alerts fired or pending during the upgrade:
      
      alert ClusterOperatorDown fired for 1290 seconds with labels: {endpoint="metrics", instance="10.0.184.18:9099", job="cluster-version-operator", name="monitoring", namespace="openshift-cluster-version", pod="cluster-version-operator-6f9db9dd74-bfnzr", service="cluster-version-operator", severity="critical", version="4.10.12"}
      alert KubePodNotReady fired for 1650 seconds with labels: {namespace="openshift-monitoring", pod="alertmanager-main-0", severity="warning"}
      alert KubeStatefulSetReplicasMismatch fired for 1170 seconds with labels: {container="kube-rbac-proxy-main", endpoint="https-main", job="kube-state-metrics", namespace="openshift-monitoring", service="kube-state-metrics", severity="warning", statefulset="alertmanager-main"} Failure May  4 09:57:07.721: Unexpected alerts fired or pending during the upgrade:
      
      alert ClusterOperatorDown fired for 1290 seconds with labels: {endpoint="metrics", instance="10.0.184.18:9099", job="cluster-version-operator", name="monitoring", namespace="openshift-cluster-version", pod="cluster-version-operator-6f9db9dd74-bfnzr", service="cluster-version-operator", severity="critical", version="4.10.12"}
      alert KubePodNotReady fired for 1650 seconds with labels: {namespace="openshift-monitoring", pod="alertmanager-main-0", severity="warning"}
      alert KubeStatefulSetReplicasMismatch fired for 1170 seconds with labels: {container="kube-rbac-proxy-main", endpoint="https-main", job="kube-state-metrics", namespace="openshift-monitoring", service="kube-state-metrics", severity="warning", statefulset="alertmanager-main"}
      
      github.com/openshift/origin/test/extended/util/disruption.(*chaosMonkeyAdapter).Test(0xc001afb0e0, 0xc003111068)
      	github.com/openshift/origin/test/extended/util/disruption/disruption.go:192 +0x32f
      k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do.func1()
      	k8s.io/kubernetes@v1.23.0/test/e2e/chaosmonkey/chaosmonkey.go:90 +0x6a
      created by k8s.io/kubernetes/test/e2e/chaosmonkey.(*Chaosmonkey).Do
      	k8s.io/kubernetes@v1.23.0/test/e2e/chaosmonkey/chaosmonkey.go:87 +0x8c}

      The pod mentioned above is for CVO. and you can see the period it was down in the intervals graph
      at the top of the prow job page. It was down for ~20m and was before the node reboot part of the
      upgrades started happening. looking at the 'oc get pods' output from the gather-extra artifacts
      you can see the new (upgraded) CVO pod is up and no restarts. So, something probably was
      breaking/broken during the initial upgrade process.

      link to this job's testgrid for reference.

       

              rh-ee-arsen Arkadeep Sen
              jluhrsen Jamo Luhrsen
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: