A Selective Policy Enforcement feature aims to provide users additional levels of control over the way in which Configuration Policy is enforced across a large set (fleet) of ManagedClusters.
Configuration Policy allows a user to declare the desired state of configuration and bind that policy to a set of ManagedClusters. This policy provides the user with both visibility into the state of compliance of their clusters as well as a mechanism to drive clusters into compliance. When the remediation action is set to "enforce" the configuration will be applied to all bound clusters immediately.
In some use cases the immediate remediation behavior of an "enforce" policy across the fleet of clusters may be unacceptable. For example, if the set of clusters are collectively meeting a Service Level Agreement (SLA) uptime spec, a single change in a Configuration Policy may create an unacceptable service downtime as all of the clusters are simultaneously updated. In another example a set of clusters may provide overlapping service coverage and changes to configuration needs to be done in progressive waves to ensure continuous coverage. A third example is when the cluster operator wants to soak a change on a handful of clusters prior to rolling the change out to the entire fleet.
With Selective Policy Enforcement a feature is introduced which allows the user control over the timing of the "enforce" remediationAction taking effect on a selected subset of the bound clusters.
Motivation:
Selective Policy Enforcement allows fine grained control over the timing and application of a policy to clusters. It is intended to support this level of control natively within the policy framework. By supporting this as a feature users, or higher level controllers/orchestrators, do not have to implement a procedural pattern of copying policies and manipulating bindings/labels to progressively enforce the policy on clusters.
Progressive Policy Rollout
One of the key results of this feature is allowing the impact of a Configuration Policy change to be "rolled out" to a fleet of clusters in a progressive way rather than all at once. The timing and choice of clusters to which the change is applied are typically use case dependent but this feature allows scenarios such as
Application to a small set of "soak" clusters before rolling the change out to the entire fleet in progressively larger groups
Staggered updates to overlapping (geographical or logical) clusters to ensure continuous service coverage
Meeting Service Level Agreements for availability by limiting concurrent updates
Applying changes to groups of clusters with varying "maintenance windows" when SLAs allow changes to be made
Applying changes to one region while deferring other regions
Goals
The goals of this feature are to
Enable users to select when a policy will be enforced
Enable users to select what subset of bound clusters the policy will be enforced to
Support progressive policy rollouts as described above
Give users visibility into the effect of a Configuration Policy change prior to remediation
Reduce complexity and scalability issues with alternative approaches
Non-Goals
Out of scope for this feature
A mechanism/feature to capture/define the timing of enforcement
A feature for managing progressive rollouts. The timing and cluster selection are typically use-case specific. There may be enough commonality to define a feature to manage progressive rollouts but that is deferred to a separate discussion/enhancement.