Epic Goal
Policy currently supports configuration of evaluationInterval for compliant and non-compliant states. In deployments where CPU use must be minimized it is typical to set the compliant evaluationInterval to relatively large values (eg 10m or longer). At steady state this works well to minimize CPU while maintaining reasonable compliance checking. However, during certain windows of time this longer evaluation interval can cause significant delays or errors in pushing out changes to configuration due to the policy not re-evaluating until the interval expires. Some examples where this is an issue:
- In a system with a set of policies with dependency relationship such that they are evaluated 1 --> 2 --> 3. The compliance of policy 2 can be affected by changes made by policy 1. For example policy 1 updates a CatalogSource and policy 2 watches the status of an operator subscription which reads from that catalog. When policy 1 makes the update it can be up to 10m (setting of compliant evaluationInterval of policy 2) before policy 2 is marked non-compliant and remediated.
- Similar system with multiple policies (dependency not required) where a change made by one policy affects compliance of another (same CatalogSource/operator status example applies.) During the timeframe where policy 1 has made the "spec" change but policy 2 has not yet been re-evaluated (up to 10m in this example) the system shows compliant across all 3 policies . This is a long duration of falsely indicated of compliance caused by the long evaluationInterval.
When making this kind of change the user is typically aware of the possibility of cascading impacts to policies and would like to temporarily adjust the compliance evaluationInterval to a lower (eg 10s) value in order to bring the system into correct compliance faster.
In order to achieve faster compliance under the existing feature set the user must:
- update/patch the policy/policies with a faster evaluationInterval
- Apply changes to the policy
- Wait for compliance
- Reset the policy/policies evaluationInterval to the slower value
In a system where policies are managed by a higher level tool flow, eg gitops, steps 1 and 4 mean the user must work through those higher level tools to make the temporary change to evaluationInterval. There is also a risk that a user forgets to reset the interval and incurs additional CPU use beyond the window of time where changes are being made (a subtle/hard to detect error).
The request is for a mechanism to temporarily override the evaluationInterval to drive the system to compliance faster and close/reduce the windows where false compliance is indicated. Some considerations for the override within the use case we are working with:
- The ability to specify the override on a per-cluster basis, such that it works in conjunction with Selective Policy Enforcement, is preferred. This limits the impact of increased CPU to the cluster(s) being affected by policy enforcement.
- Modifying the policies on the hub cluster through edit/patching is not viable because the are maintained in a git repository and the changes would be overwritten.
Why is this important?
In a large fleet of clusters the time it takes to roll out a set of configuration policies must be minimized. With long evaluationIntervals and dependent policies a change which takes 1 minute to do manually may take 20-30 minutes.
False indications of compliance may cause incorrect action to be taken by external automation/orchestrators, cluster admins, etc. For example a config change is rolled out in several policies and concurrent compliance of those policies is considered an indication of "done". The automation or admin will see this concurrent compliance and take action prior to the cluster being actually ready.
Scenarios
...
Acceptance Criteria
...
Dependencies (internal and external)
- ...
Previous Work (Optional):
- ...
Open questions:
- …
Done Checklist
- CI - CI is running, tests are automated and merged.
- Release Enablement <link to Feature Enablement Presentation>
- DEV - Upstream code and tests merged: <link to meaningful PR or GitHub
Issue> - DEV - Upstream documentation merged: <link to meaningful PR or GitHub
Issue> - DEV - Downstream build attached to advisory: <link to errata>
- QE - Test plans in Polarion: <link or reference to Polarion>
- QE - Automated tests merged: <link or reference to automated tests>
- DOC - Downstream documentation merged: <link to meaningful PR>
- is related to
-
ACM-8968 Event driven ConfigurationPolicy
- Closed