Loading...

XML

Word

Printable

Type: Epic
Resolution: Unresolved
Priority: Normal
Fix Version/s: mto-1.2
Affects Version/s: None
Component/s: Multiarch-Tuning-Operator
Labels:
None

Epic Name:
Alert for Exec Format Error
Epic Status:
In Progress
Activity Type:
None
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Color Status:
Green
Status Summary:
Hide

[2025/07/25] <Green> Green for Release

The main feature is now almost dev complete.

[2025/07/03] <GREEN> Green for release
[2025/06/18] <GREEN> Green for release
[2025/06/11] <GREEN> Green for release

Development is in progress and green.
Show
[2025/07/25] <Green> Green for Release The main feature is now almost dev complete. [2025/07/03] <GREEN> Green for release [2025/06/18] <GREEN> Green for release [2025/06/11] <GREEN> Green for release Development is in progress and green.
Size:
L

Target Version:

mto-1.2
Release Blocker:
None

Epic Goal

To expose an alert in the clusters that collect any instance of exec format errors in nodes.

Why is this important?

Exec format errors often arise in the multiarch environments when nodes different than a main architecture (x86 historically) run pods with multi-arch images but storing binaries of another architecture than the manifests' one. These failure may cause crashloopbackoffs pods, or they can cause partial failures that are harder to discover and debug, even though the images report supporting the architecture where they are running.

Design document

https://github.com/openshift/multiarch-tuning-operator/blob/main/docs/enhancements/MTO-0004-enoexec-monitoring.md

Scenarios
1. As a cluster administrator migrating a cluster from x86 to multi, I want to detect any "exec format error" failures happening due to pods being scheduled in arm64 nodes so that a mitigation can be applied to the workload until developers can be notified and fix the container file.

Acceptance Criteria

An alert and metric is spawn at cluster level to notify about any exec format errors.
An event is recorded in the pod to inform about the occurrence of the exec format error.

Dependencies (internal and external)
1. Should this go in a core component or also in other operators?

Previous Work (Optional):

https://issues.redhat.com/browse/MULTIARCH-4253?focusedId=24848869&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-24848869 concluded with considering the multiarch-tuning-operator a candidate for this kind of task aiming at improving the ux for users onboarding multiarch compute workers OCP

The following command attaches a BPF probe in a node's kernel to detect exec format errors and print the PPID of the process. we can exploit BPF in a daemonset to detect the PID of the parent of process leading to an exec format error. The daemonset can then retrieve the container name, the pod and produce a metric and alert for the cluster-monitoring-operator.

# bpftrace -e 'tracepoint:syscalls:sys_exit_execve /args->ret == -8/ { printf("Exec format error detected for PPID %d\n", curtask->real_parent->pid); }'

Open questions::
1. Should this land in a core component or in other operators too?

Done Checklist

CI - For new features (non-enablement), existing Multi-Arch CI jobs are not broken by the Epic
Release Enablement: <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR orf GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - If the Epic is adding a new stream, downstream build attached to advisory: <link to errata>
QE - Test plans in Test Plan tracking software (e.g. Polarion, RQM, etc.): <link or reference to the Test Plan>
QE - Automated tests merged: <link or reference to automated tests>
QE - QE to verify documentation when testing
DOC - Downstream documentation merged: <link to meaningful PR>
All the stories, tasks, sub-tasks and bugs that belong to this epic need to have been completed and indicated by a status of 'Done'.

links to

openshift/multiarch-tuning-operator#566: MULTIARCH-5010: enoexec monitoring

outrigger-project/multiarch-tuning-operator#643: Partially move Pod model in pkg/models

outrigger-project/multiarch-tuning-operator#669: MULTIARCH-5010: Minor fixes

Assignee:: Alessandro Di Stefano

Reporter:: Alessandro Di Stefano

Contributors:: None

QA Contact:: None

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2024/09/19 1:57 PM

Updated:: 2025/10/08 5:45 AM

Target start:: 2025/04/13

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates