-
Epic
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
Alert for Exec Format Error
-
None
-
0% To Do, 0% In Progress, 100% Done
-
False
-
-
False
-
Green
-
-
L
-
None
Epic Goal
- To expose an alert in the clusters that collect any instance of exec format errors in nodes.
Why is this important?
- Exec format errors often arise in the multiarch environments when nodes different than a main architecture (x86 historically) run pods with multi-arch images but storing binaries of another architecture than the manifests' one. These failure may cause crashloopbackoffs pods, or they can cause partial failures that are harder to discover and debug, even though the images report supporting the architecture where they are running.
Design document
Scenarios
1. As a cluster administrator migrating a cluster from x86 to multi, I want to detect any "exec format error" failures happening due to pods being scheduled in arm64 nodes so that a mitigation can be applied to the workload until developers can be notified and fix the container file.
Acceptance Criteria
- An alert and metric is spawn at cluster level to notify about any exec format errors.
- An event is recorded in the pod to inform about the occurrence of the exec format error.
Dependencies (internal and external)
1. Should this go in a core component or also in other operators?
Previous Work (Optional):
- https://issues.redhat.com/browse/MULTIARCH-4253?focusedId=24848869&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-24848869 concluded with considering the multiarch-tuning-operator a candidate for this kind of task aiming at improving the ux for users onboarding multiarch compute workers OCP
- The following command attaches a BPF probe in a node's kernel to detect exec format errors and print the PPID of the process. we can exploit BPF in a daemonset to detect the PID of the parent of process leading to an exec format error. The daemonset can then retrieve the container name, the pod and produce a metric and alert for the cluster-monitoring-operator.
# bpftrace -e 'tracepoint:syscalls:sys_exit_execve /args->ret == -8/ { printf("Exec format error detected for PPID %d\n", curtask->real_parent->pid); }'
Open questions::
1. Should this land in a core component or in other operators too?
Done Checklist
- CI - For new features (non-enablement), existing Multi-Arch CI jobs are not broken by the Epic
- Release Enablement: <link to Feature Enablement Presentation>
- DEV - Upstream code and tests merged: <link to meaningful PR orf GitHub Issue>
- DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
- DEV - If the Epic is adding a new stream, downstream build attached to advisory: <link to errata>
- QE - Test plans in Test Plan tracking software (e.g. Polarion, RQM, etc.): <link or reference to the Test Plan>
- QE - Automated tests merged: <link or reference to automated tests>
- QE - QE to verify documentation when testing
- DOC - Downstream documentation merged: <link to meaningful PR>
- All the stories, tasks, sub-tasks and bugs that belong to this epic need to have been completed and indicated by a status of 'Done'.