Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Component/s: internal
Labels:
None

Story Points:
3
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs Approval:
?
AssignedTeam:
rhos-ops-day1day2-hardprov
Regression:
None
Intelligence Requested:
Market:
PX Impact Score:

Sprint:
HardProv Sprint 23
sprint_count:
1
Severity:
Moderate

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

The bmo01-rhel9-rhoso18.0 job fails during the post-infrastructure PCP metrics collection hook when attempting to gather performance metrics from OCP master nodes. All three OCP master nodes (ocp-master-0, ocp-master-1, ocp-master-2) become unreachable via SSH during the "pcp_metrics : Populate service facts" task, causing the job to fail with "NO MORE HOSTS LEFT" error.

During the execution of the pre_logs_pcp_metrics hook at the end of the job run, the playbook attempts to collect PCP (Performance Co-Pilot) metrics from all hosts including the OCP master nodes. However, SSH connections to all three OCP master nodes timeout after ~20 seconds during Python interpreter discovery.

The OCP master nodes were successfully configured earlier in the job (around 06:44-06:45) and SSH access was working properly at that time. The nodes only become unreachable during the PCP metrics collection phase near the end of the job execution.

Expected Behavior:
- PCP metrics should be collected successfully from all OCP master nodes
- SSH connections should remain stable throughout the job execution
- Job should complete successfully

Actual Behavior:
- SSH connections to master-0.utility, master-1.utility, and master-2.utility timeout
- Python interpreter discovery fails for all OCP master nodes
- Job fails with "NO MORE HOSTS LEFT" error
- PLAY RECAP shows: unreachable=1 for each OCP master node

Failure Location:
- Playbook: pcp-metrics-post.yml (pre_logs_pcp_metrics hook)
- Role: pcp_metrics
- Task: Populate service facts

Possible Cause:
1. Network connectivity issues between controller and OCP master nodes after extended job runtime
2. SSH daemon issues on OCP master nodes (resource exhaustion, max connections)
3. Firewall or network routing changes during job execution
4. OCP master nodes becoming unresponsive due to resource constraints
5. SSH jumper/proxy configuration timing out or becoming stale
- Log: ci_script_000_run_hook_without_retry_pcp.log
- Timestamp: 2026-02-18 07:02:41 UTC

Additional Context:
- controller-0 and hypervisor hosts remain reachable and responsive
- Connection timeout is exactly at SSH level (port 22)
- All three master nodes fail simultaneously, suggesting a systemic issue rather than individual node problems
- The failure occurs consistently at the same phase (PCP metrics collection)

Assignee:: Abhishek Bongale

Reporter:: Abhishek Bongale

Team:: rhos-dfg-hardprov

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2026/02/23 1:14 PM

Updated:: 2026/02/26 11:57 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty