• Alert for Exec Format Error
    • None
    • 0% To Do, 0% In Progress, 100% Done
    • False
    • Hide

      None

      Show
      None
    • False
    • Green
    • Hide

      [2025/07/25] <Green> Green for Release

      • The main feature is now almost dev complete.

      [2025/07/03] <GREEN> Green for release
      [2025/06/18] <GREEN> Green for release
      [2025/06/11] <GREEN> Green for release

      • Development is in progress and green.
      Show
      [2025/07/25] <Green> Green for Release The main feature is now almost dev complete. [2025/07/03] <GREEN> Green for release [2025/06/18] <GREEN> Green for release [2025/06/11] <GREEN> Green for release Development is in progress and green.
    • L
    • None

      Epic Goal

      • To expose an alert in the clusters that collect any instance of exec format errors in nodes.

      Why is this important?

      • Exec format errors often arise in the multiarch environments when nodes different than a main architecture (x86 historically) run pods with multi-arch images but storing binaries of another architecture than the manifests' one. These failure may cause crashloopbackoffs pods, or they can cause partial failures that are harder to discover and debug, even though the images report supporting the architecture where they are running.

      Design document

      https://github.com/openshift/multiarch-tuning-operator/blob/main/docs/enhancements/MTO-0004-enoexec-monitoring.md

      Scenarios
      1. As a cluster administrator migrating a cluster from x86 to multi, I want to detect any "exec format error" failures happening due to pods being scheduled in arm64 nodes so that a mitigation can be applied to the workload until developers can be notified and fix the container file.

      Acceptance Criteria

      • An alert and metric is spawn at cluster level to notify about any exec format errors.
      • An event is recorded in the pod to inform about the occurrence of the exec format error.

      Dependencies (internal and external)
      1. Should this go in a core component or also in other operators?

      Previous Work (Optional):

      • The following command attaches a BPF probe in a node's kernel to detect exec format errors and print the PPID of the process. we can exploit BPF in a daemonset to detect the PID of the parent of process leading to an exec format error. The daemonset can then retrieve the container name, the pod and produce a metric and alert for the cluster-monitoring-operator.
      # bpftrace -e 'tracepoint:syscalls:sys_exit_execve /args->ret == -8/ { printf("Exec format error detected for PPID %d\n", curtask->real_parent->pid); }'
      

      Open questions::
      1. Should this land in a core component or in other operators too?

      Done Checklist

      • CI - For new features (non-enablement), existing Multi-Arch CI jobs are not broken by the Epic
      • Release Enablement: <link to Feature Enablement Presentation>
      • DEV - Upstream code and tests merged: <link to meaningful PR orf GitHub Issue>
      • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
      • DEV - If the Epic is adding a new stream, downstream build attached to advisory: <link to errata>
      • QE - Test plans in Test Plan tracking software (e.g. Polarion, RQM, etc.): <link or reference to the Test Plan>
      • QE - Automated tests merged: <link or reference to automated tests>
      • QE - QE to verify documentation when testing
      • DOC - Downstream documentation merged: <link to meaningful PR>
      • All the stories, tasks, sub-tasks and bugs that belong to this epic need to have been completed and indicated by a status of 'Done'.

              rhn-support-adistefa Alessandro Di Stefano
              rhn-support-adistefa Alessandro Di Stefano
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: