Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: rhos-17.1.z
Affects Version/s: rhos-17.1.z
Component/s: openstack-tripleo-heat-templates
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Bugzilla Bug:
RHBZ: 2323714
AssignedTeam:
rhos-ops-platform-services-security
Regression:
None
Intelligence Requested:
Market:
PX Impact Score:

Severity:
Important

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:
Two RHOSP 16.2 customers reported same problem recently: at some point control plane services became unavailable and stopped responding to API calls. Problems in two mentioned deployments had slightly different symptoms, but same root cause behind it: certmonger didn't trigger proper service restarts after new certificates were issued.

It looks like we don't trigger restarts for pacemaker-controlled services intentionally because we can't predict how pacemaker will react to local container restarts. While this makes sense from cluster stability perspective (imagine galera being blocked after containers are restarted locally by certmonger automation), customer still should restart them at some point (usually they have around 30 days to do this) and schedule relevant maintenance window. It probably makes sense to tune our documentation for RHOSP 16.2 and 17.1 here, but I want to ask engineering to double-check if documentation change is best approach here first and let me know if a bug should be reported.

When it comes to other services (mostly API endpoints), then it looks like they are not restarted properly and simple USR1 signal sent by pkill to all httpd processes doesn't make it to containerized httpd processes or doesn't trigger proper restarts: without manual restart services continue to process requests using old cert bundle.

Version-Release number of selected component (if applicable): reported for RHOSP 16.2, RHOSP 17.1 is likely affected as well

How reproducible: don't restart control plane containers for 30 days after certmonger automatically renews certificates and see it dying

Actual results: customer are unaware about requirement to restart some containers, automation doesn't handle certificate renewals properly

Expected results: customers are able to properly handle expired certificates, automation handles most tasks

is duplicated by

OSPRH-12160 BZ#2323714 [TLS-E] Certmonger doesn't trigger service restarts after updating expiring certificates