-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
3
-
False
-
-
False
-
?
-
rhos-ops-day1day2-hardprov
-
None
-
-
-
-
HardProv Sprint 23
-
1
-
Moderate
The bmo01-rhel9-rhoso18.0 job fails during the post-infrastructure PCP metrics collection hook when attempting to gather performance metrics from OCP master nodes. All three OCP master nodes (ocp-master-0, ocp-master-1, ocp-master-2) become unreachable via SSH during the "pcp_metrics : Populate service facts" task, causing the job to fail with "NO MORE HOSTS LEFT" error.
During the execution of the pre_logs_pcp_metrics hook at the end of the job run, the playbook attempts to collect PCP (Performance Co-Pilot) metrics from all hosts including the OCP master nodes. However, SSH connections to all three OCP master nodes timeout after ~20 seconds during Python interpreter discovery.
The OCP master nodes were successfully configured earlier in the job (around 06:44-06:45) and SSH access was working properly at that time. The nodes only become unreachable during the PCP metrics collection phase near the end of the job execution.
Expected Behavior:
- PCP metrics should be collected successfully from all OCP master nodes
- SSH connections should remain stable throughout the job execution
- Job should complete successfully
Actual Behavior:
- SSH connections to master-0.utility, master-1.utility, and master-2.utility timeout
- Python interpreter discovery fails for all OCP master nodes
- Job fails with "NO MORE HOSTS LEFT" error
- PLAY RECAP shows: unreachable=1 for each OCP master node
Failure Location:
- Playbook: pcp-metrics-post.yml (pre_logs_pcp_metrics hook)
- Role: pcp_metrics
- Task: Populate service facts
Possible Cause:
1. Network connectivity issues between controller and OCP master nodes after extended job runtime
2. SSH daemon issues on OCP master nodes (resource exhaustion, max connections)
3. Firewall or network routing changes during job execution
4. OCP master nodes becoming unresponsive due to resource constraints
5. SSH jumper/proxy configuration timing out or becoming stale
- Log: ci_script_000_run_hook_without_retry_pcp.log
- Timestamp: 2026-02-18 07:02:41 UTC
Additional Context:
- controller-0 and hypervisor hosts remain reachable and responsive
- Connection timeout is exactly at SSH level (port 22)
- All three master nodes fail simultaneously, suggesting a systemic issue rather than individual node problems
- The failure occurs consistently at the same phase (PCP metrics collection)