-
Epic
-
Resolution: Unresolved
-
Major
-
None
-
rhel-10.1
-
None
-
IPA state control
-
None
-
FutureFeature
-
rhel-idm-ipa
-
None
-
False
-
False
-
-
None
-
Red Hat Enterprise Linux
-
None
-
None
-
None
-
Unspecified
-
Unspecified
-
Unspecified
-
-
All
-
None
Description
As a system administrator, I want the FreeIPA deployment to be highly available and operationally robust by implementing intelligent health awareness and automated recovery behaviors. Specifically, the system should fulfill the following goals:
- Be aware of the real-time health state of all IPA replicas
Continuously monitor and expose the operational status of each replica (healthy / degraded / unhealthy / maintenance / hidden).
- Automatically remove unhealthy or maintenance replicas from client traffic pools
Withdraw replicas from DNS SRV pools (e.g., via dynamic DNS updates, health-check based removal from LDAP/Kerberos service records) when they enter a bad health state or are placed in maintenance mode.
- Automatically reintroduce healed replicas into service
Re-add replicas to client-facing pools once they return to a healthy state (automatic re-healing when self-diagnosed issues are resolved).
- Implement dependency-aware health checks
Tie a replica's reported health to the availability and correct functioning of its critical dependencies.
Example: A KDC should be marked unhealthy and removed from rotation/put down if its local LDAP backend is unavailable or responding incorrectly. This allows clients to automatically failover to healthy replicas instead of being stuck trying a broken instance.
- Support extensible / pluggable health evaluation logic
Provide an architectural framework that makes it easy to add new health triggers and conditions in the future without major refactoring.
Examples of future extensions:
- React to self-state changes (e.g., CA certificate list change, shared certificate change, replica list change, replication lag exceeding threshold)
- Ideally integrate external signals (e.g., monitoring alerts, node-level metrics like memory leaking, network issues)
- Possibility of custom scripts for site-specific checks
These capabilities should work together to achieve the following outcomes:
- Minimize client-perceived downtime during replica failures or maintenance
- Reduce manual intervention for common failure modes
- Improve overall cluster resilience and observability
What SSTs and Layered Product teams should review this?
FreeIPA dev team.