Uploaded image for project: 'OpenShift Pipelines'
  1. OpenShift Pipelines
  2. SRVKP-7343

When Task Step Container is OOMfailed the OOM failure is not mentioned in the Task status

XMLWordPrintable

    • 1
    • False
    • Hide

      None

      Show
      None
    • False
    • TaskRuns that fail due to Out of Memory (OOM) conditions will now show the termination reason in their failure message.
    • Bug Fix
    • KONFLUX-💚Green-S283, Pipelines Sprint Tekshift 25, Pipelines Sprint Tekshift 26, Pipelines Sprint Tekshift 28, Pipelines Sprint Tekshift 29, Pipelines Sprint Tekshift 30

      Description of problem:

      When a Step in a Task fails due to OOM, causing the Task to fail,  OOM is not included in the Task failure message

      This is already something we explicitly check for, but it appears that there is a bug so the code is never reached.

       

      Since OOMKilled containers have the `"Reason": "OOMKilled"` in their Termination state, simply include the Termination Reason when extracting the container termination message

       

      As a user of Openshift Pipelines, I may have a Task which does not have appropriate memory request configuration. Currently, if a Task fails due to one of the steps' containers being OOMkilled, the Task is failed and the message "<step-name> exited with code 137". This indicates that the step container was killed by an external SIGKILL signal. In other words, a knowing user could still only ascertain that the pod was killed by the kubelet and didn't error for internal reasons. Without kubernetes access to view the pod (before it's cleaned up) or 011y access like Grafana, a user can only speculate about what caused the pod to be evicted or if it was even the pod's fault. 

      Prerequisites (if any, like setup, operators/versions):

      Steps to Reproduce

      {{apiVersion: tekton.dev/v1
      kind: TaskRun
      metadata:
        generateName: stress-test-
      spec:
        computeResources:
          requests:
            memory: 64Mi
          limits:
            memory: 64Mi
        taskSpec:
          steps:
          - image: mirror.gcr.io/ubuntu
            script: |
              #!/usr/bin/env bash
              apt-get update
              apt-get -y install stress
              stress --vm 4 --vm-bytes 256M --timeout 300}}

      • With Tekton running in a k8s cluster of any kind, create the above taskrun using `kubectl create
      • Wait for the TaskRun to fail
      • Get the status of the taksrun using `kubectl tkn desc <taskrun-name>` and using `kubectl tkn desc <taskrun-name> -o=yaml`

       

      Actual results:

      The taskrun message will be ""step-unnamed-0" exited with code 137", wheras the YAML shows status.steps[0].terminated.reason: "OOMKilled"

      Expected results:

      The taskrun message should be something more along the lines of ""step-unnamed-0" exited with code 137: OOMkilled"

      Reproducibility (Always/Intermittent/Only Once):

      Acceptance criteria: 

       

      Definition of Done:

      Build Details:

      Additional info (Such as Logs, Screenshots, etc):

       

       *

              rh-ee-shubbhar Shubham Bhardwaj
              rh-ee-athorp Andrew Thorp
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: