-
Task
-
Resolution: Done
-
Major
-
None
-
None
-
None
-
False
-
None
-
False
-
KONFLUX-4155 - SRE Enablement
-
Release Note Not Required
-
-
-
Pipelines Sprint Pioneers 10, Pipelines Sprint Pioneers 11
There is no SOP linked to this alert so one needs to be created.
Some of these SOPS also need to be improved. The main improvement points for these SOPS are:
- SOPs should have actionable steps that SREs can follow to avoid immediate escalation to engineering.
- Any referenced resources should be linked, or instructions on how to find them provided. (In this case Grafana dashboards/panels).
Please provide a steps-by-step process for resolving the alert in the SOP. If we currently do not have any experience of this alert firing and therefore can’t provide exact steps to resolve, please provide some guides to debugging the issue. E.g.
- What namespace(s) should SRE be looking at
- What resource(s) should SRE look at (pods, deployments, services etc.)
- What logs should SRE look at?
-
- Provide information on what pod logs should be investigated. If necessary, provide instruction to query Splunk for logs.
-
- Also consider here what logs and resources might be collect for further investigation. After the alert resolves there may still be an investigation into the cause in which SRE will reach out to engineering. What resources would engineering want to have to investigate after the incident? Consider the oc adm inspect command.
- Are there any component that can be safely restarted which might resolve the issue?
-
- How? Delete pods or rollout restart of deployments?
Please also provide links to Grafana dashboards and/or panels. If the relevant dashboards are on the dataplane cluster and therefore can’t be linked (as there might be multiple dashboards) please provide steps on where to find the Grafana instance. E.g. in which namespace to find it.
If necessary provide additional details about what we are trying to determine from the Grafana dashboard. Assume this it the SREs first time seeing this dashboard.
For example some SOPs say:
View grafana panel 'some panel' to verify whether this is a general issue or specific to the user.
How would SRE verify this?