Uploaded image for project: 'OCP Technical Release Team'
  1. OCP Technical Release Team
  2. TRT-1143

Mass job failures due to KubeletHealthState alert firing in openshift-machine-config-operator ns

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Blocker Blocker
    • None
    • None
    • False
    • None
    • False

      Began sometime today or late yesterday, affects multiple clouds and is blocking payloads with most jobs failing.

      Example: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-azure-ovn/1678963706211340288

       [sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early][apigroup:config.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel] expand_less
      Run #0: Failed expand_less 	1m0s
      {  "service": "machine-config-daemon",
                "severity": "warning"
              },
              "value": [
                1689134505.447,
                "1"
              ]
            },
            {
              "metric": {
                "__name__": "ALERTS",
                "alertname": "KubeletHealthState",
                "alertstate": "firing",
                "container": "oauth-proxy",
                "endpoint": "metrics",
                "instance": "10.0.0.8:9001",
                "job": "machine-config-daemon",
                "namespace": "openshift-machine-config-operator",
                "node": "ci-op-fbh5bhvb-ed2ea-xdsvq-master-1",
                "pod": "machine-config-daemon-f5wmt",
                "prometheus": "openshift-monitoring/k8s",
                "service": "machine-config-daemon",
                "severity": "warning"
              },
              "value": [
                1689134505.447,
                "1"
              ]
            },
      

      Being discussed here: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1689169944865979

      At present we can find no changes in this payload that weren't in previous that did not exhibit the issue.
      The new rhcos version seems fine in ci payloads.
      Problem began surfacing for MCO in their presubmits a few days prior.

              rhn-engineering-dgoodwin Devan Goodwin
              rhn-engineering-dgoodwin Devan Goodwin
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: