Uploaded image for project: 'OCP Technical Release Team'
  1. OCP Technical Release Team
  2. TRT-636

Investigate why alert shows about 1 hour difference in spyglass chart vs failure log

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Not a Bug
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • False
    • None
    • False

      In [this job|https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/1331/pull-ci-openshift-ovn-kubernetes-master-4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade/1583427008199659520), we see spyglass matching these times:

      $ cat e2e-events_20221021-130143.json |jq '.items[]|select(.locator|test("ExtremelyHighIndividualControlPlaneCPU"))'
      {
        "level": "Info",
        "locator": "alert/ExtremelyHighIndividualControlPlaneCPU node/ip-10-0-165-15.ec2.internal ns/openshift-kube-apiserver",
        "message": "ALERTS{alertname=\"ExtremelyHighIndividualControlPlaneCPU\", alertstate=\"pending\", instance=\"ip-10-0-165-15.ec2.internal\", namespace=\"openshift-kube-apiserver\", prometheus=\"openshift-monitoring/k8s\", severity=\"critical\"}",
        "from": "2022-10-21T13:21:33Z",
        "to": "2022-10-21T13:26:33Z"
      }
      {
        "level": "Warning",
        "locator": "alert/ExtremelyHighIndividualControlPlaneCPU node/ip-10-0-165-15.ec2.internal ns/openshift-kube-apiserver",
        "message": "ALERTS{alertname=\"ExtremelyHighIndividualControlPlaneCPU\", alertstate=\"firing\", instance=\"ip-10-0-165-15.ec2.internal\", namespace=\"openshift-kube-apiserver\", prometheus=\"openshift-monitoring/k8s\", severity=\"warning\"}",
        "from": "2022-10-21T13:26:33Z",
        "to": "2022-10-21T13:39:01Z"
      }
       

      yet the job shows:

      : [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] expand_less1h23m28s{  fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:197]: Oct 21 14:23:24.339: Unexpected alerts fired or pending during the upgrade:
      
      alert ExtremelyHighIndividualControlPlaneCPU fired for 750 seconds with labels: {instance="ip-10-0-165-15.ec2.internal", namespace="openshift-kube-apiserver", severity="warning"}
      Ginkgo exit error 1: exit with code 1} 

      i.e., the junit xml says 13:26:33 and the prow output says 14:23:24.  That's an hour difference.  I feel that 13:26:33 is closer because 14:23 is at the end of the chart.

              Unassigned Unassigned
              dperique@redhat.com Dennis Periquet
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: