Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-8938

alert TargetDown fired for XXX seconds with labels: {job="machine-config-daemon", namespace="openshift-machine-config-operator", service="machine-config-daemon", severity="warning"}

XMLWordPrintable

    • Important
    • None
    • MCO Sprint 240, MCO Sprint 241
    • 2
    • Rejected
    • Unspecified
    • If docs needed, set a value

      From https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.9-ocp-jenkins-e2e-remote-libvirt-ppc64le/1423947091704549376:

      ```
      alert TargetDown fired for 13 seconds with labels:

      {job="machine-config-daemon", namespace="openshift-machine-config-operator", service="machine-config-daemon", severity="warning"}

      ```

      Checking kubelet logs for all the nodes:
      ```
      Aug 07 10:11:49.788245 libvirt-ppc64le-1-1-9-kfv8v-master-0 crio[1244]: time="2021-08-07 10:11:49.788169211Z" level=info msg="Started container dd7e2473c51870c1894531af9a3935b907340a31216f85c32e391bddf22d7fd0: openshift-machine-config-operator/machine-config-daemon-7r2bb/machine-config-daemon" id=15456b41-39c9-41ce-8f10-71398df6dd26 name=/runtime.v1alpha2.RuntimeService/StartContainer
      Aug 07 10:11:49.265439 libvirt-ppc64le-1-1-9-kfv8v-master-1 crio[1242]: time="2021-08-07 10:11:49.264443242Z" level=info msg="Created container 0651d7904d63a3f2c1fa9177d2ccf890c8fc769e96c836074aa8cc28a8bd7e04: openshift-machine-config-operator/machine-config-daemon-pk29l/machine-config-daemon" id=a622e284-7d45-4b72-b271-c39081c2c77a name=/runtime.v1alpha2.RuntimeService/CreateContainer
      Aug 07 10:11:49.602420 libvirt-ppc64le-1-1-9-kfv8v-master-2 crio[1243]: time="2021-08-07 10:11:49.602359290Z" level=info msg="Started container 5a24f464210595cd394aacd4e98903a196d67762a53d764bd6f4a6010cc17acf: openshift-machine-config-operator/machine-config-daemon-69fw6/machine-config-daemon" id=89b0650c-741e-4c61-ab49-f68aa82cb302 name=/runtime.v1alpha2.RuntimeService/StartContainer
      Aug 07 10:15:54.666525 libvirt-ppc64le-1-1-9-kfv8v-worker-0-gddxw crio[1252]: time="2021-08-07 10:15:54.666233168Z" level=info msg="Started container 8ba32989af629e00c35578c51e9b5612ca8ddcf97b32f2b500d777a6eb2ff2e1: openshift-machine-config-operator/machine-config-daemon-5tb88/machine-config-daemon" id=4fa0e2ba-54aa-41a8-ab7b-7a3b6f6a9998 name=/runtime.v1alpha2.RuntimeService/StartContainer
      Aug 07 10:16:14.170188 libvirt-ppc64le-1-1-9-kfv8v-worker-0-p76x7 crio[1235]: time="2021-08-07 10:16:14.170137303Z" level=info msg="Started container 78d933af1e7100050332b1df62e67d1fc71ca735c7a7d3c060411f61f32a0c74: openshift-machine-config-operator/machine-config-daemon-k6l8w/machine-config-daemon" id=c344fd94-abeb-4393-87f3-5bcaba21d45f name=/runtime.v1alpha2.RuntimeService/StartContainer
      ```

      All containers started before the test started (before 2021-08-07T10:28:00Z, see https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.9-ocp-jenkins-e2e-remote-libvirt-ppc64le/1423947091704549376/build-log.txt). Checking https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.9-ocp-jenkins-e2e-remote-libvirt-ppc64le/1423947091704549376/artifacts/ocp-jenkins-e2e-remote-libvirt-ppc64le/gather-libvirt/artifacts/pods.json:

      ```
      machine-config-daemon-5tb88_machine-config-daemon.log: assigned to libvirt-ppc64le-1-1-9-kfv8v-worker-0-gddxw, 0 restarts, ready since 2021-08-07T10:16:07Z
      machine-config-daemon-k6l8w_machine-config-daemon.log: assigned to libvirt-ppc64le-1-1-9-kfv8v-worker-0-p76x7, 0 restarts, ready since 2021-08-07T10:16:14Z
      machine-config-daemon-69fw6_machine-config-daemon.log: assigned to libvirt-ppc64le-1-1-9-kfv8v-master-2, 0 restarts, ready since 2021-08-07T10:11:49Z
      machine-config-daemon-pk29l_machine-config-daemon.log: assigned to libvirt-ppc64le-1-1-9-kfv8v-master-1, 0 restarts, ready since 2021-08-07T10:11:49Z
      machine-config-daemon-7r2bb_machine-config-daemon.log: assigned to libvirt-ppc64le-1-1-9-kfv8v-master-0, 0 restarts, ready since 2021-08-07T10:11:49Z
      ```

      All containers were running since they got created and never restarted.

      The incident (alert TargetDown fired for 13 seconds) occurred at August 7, 2021 10:33:18 AM. The test suite finished 2021-08-07T10:33:40Z.

      Based on the TargetDown definition (see https://github.com/openshift/cluster-monitoring-operator/blob/001eccd81ff51af0ed7a9d463dd35bfa9b75d102/assets/cluster-monitoring-operator/prometheus-rule.yaml#L16-L28):
      ```

      • alert: TargetDown
        annotations:
        description: '{{ printf "%.4g" $value }}% of the {{ $labels.job }}/{{ $labels.service
        }} targets in {{ $labels.namespace }} namespace have been unreachable for
        more than 15 minutes. This may be a symptom of network connectivity issues,
        down nodes, or failures within these components. Assess the health of the
        infrastructure and nodes running these targets and then contact support.'
        summary: Some targets were not reachable from the monitoring server for an
        extended period of time.
        expr: |
        100 * (count(up == 0 unless on (node) max by (node) (kube_node_spec_unschedulable == 1)) BY (job, namespace, service) /
        count(up unless on (node) max by (node) (kube_node_spec_unschedulable == 1)) BY (job, namespace, service)) > 10
        for: 15m
        ```

      The machine-config-daemon was down for 15m and 13s. Given the test suite ran for ~5m42s (10:33:18-10:28:00), the target was down before the test suite started to run.

      This patterns repears in other jobs as well:

      For other jobs see:
      https://search.ci.openshift.org/?search=alert+TargetDown+fired+for+.*+seconds+with+labels%3A+%5C%7Bjob%3D%22machine-config-daemon%22%2C+namespace%3D%22openshift-machine-config-operator%22%2C+service%3D%22machine-config-daemon%22%2C+severity%3D%22warning%22%5C%7D&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

              cdoern@redhat.com Charles Doern
              jchaloup@redhat.com Jan Chaloupka
              Zhanqi Zhao Zhanqi Zhao
              Red Hat Employee
              Zhanqi Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              21 Start watching this issue

                Created:
                Updated:
                Resolved: