-
Task
-
Resolution: Done
-
Undefined
-
None
-
None
-
None
-
EnVision Sprint 58, EnVision Sprint 59
Hi all, {}ADR-046 was approved, grafana dashboards were built, bonfire code changes were made, and now we are ready for app teams to make their necessary changes. The goal of this effort is to define proper CPU/memory resource requests and limits when deploying app templates so that we can use Ephemeral cluster resources more efficiently. The audit and template changes you make here should also be helpful down the road for Stage and Production (we hope to run similar audits there in future – because we also tend to under-utilize CPU and memory in those environments as well!). Hopefully this helps us require less OpenShift nodes, which translates to money saved. Once you implement this, bonfire will no longer strip resource configurations from your apps, which should lead to less surprise pod OOM's, and at the same time we'll no longer rely on the pesky "--no-remove-resources" CLI option as a work-around (it's a bad work-around because while it fixes a problem in one app component, it causes other components to request more CPU/memory than needed). WHAT
We have created an ADR-046 Implementation Guide for teams to follow. The implementation guide provides all the in-depth details but I'll re-iterate the high-level overview here... this is what each app needs to do:
- Update your ClowdApp templates to use template parameters for all CPU/memory requests and limits. The parameter names need to match the naming format (examples are in the implementation guide)
- Open a temporary PR which deploys your app without resources stripped (a.k.a using '--no-remove-resources') to invoke a smoke test that will "exercise" your app. Running the tests in this way should help better determine how much CPU/mem your apps realistically need during a PR check. (examples are in the implementation guide)
- Use our ephemeral pod right-sizing grafana dashboard to determine the proper resource values for your containers. The implementation guide explains how to use this dashboard. The goal here is to observe the actual amount of CPU and memory that your app truly consumes and use the values here when setting the parameters in your deployment config.
- Set the parameter values properly on your ephemeral deploy target in app-interface (examples are in the implementation guide)
- Remove usage of bonfire's --no-remove-resources option (if you were using it)
- After making your template changes and adding the parameters to your ephemeral deployment configuration, make sure that your PR checks continue to work as they did before.
Note: Please update the status spreadsheet when you begin/finish work on this so we know how things are moving along across teams. WHEN
We are asking for this to be completed by the end of Q3 (September 30, 2024). Please reach out if you do not think this is feasible.
WHO
Any app team that deploys to ephemeral environments. The status spreadsheet lists all components that currently have an ephemeral deployment config in app-interface. If you spot old/deprecated components please mark them as "Not Needed" and leave a comment.