Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: rhos-18.0.10 FR 3
Affects Version/s: rhos-18.0 FR 2 (Mar 2025)
Component/s: edpm-ansible, nova-operator
Labels:
- edpm
- nova

Story Points:
3
Epic Link:
Require a FQDN on EDPM nodes
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs Approval:
?
Fixed in Build:
openstack-ansible-ee-container-1.0.11-8
AssignedTeam:
rhos-ops-day1day2-edpm
Regression:
None
Intelligence Requested:
Market:
Errata Link:
https://errata.engineering.redhat.com/advisory/152103

Sprint:
EDPM Sprint 2, EDPM Sprint 3, EDPM Sprint 4, EDPM Sprint 5
sprint_count:
4
Severity:
Important

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

This is kind of an ugly issue and I'm not sure of a fix other than re-deploying the edpm nodes (or db surgury). Hopefully there is an easier solution.

New RHOSO 18 environment - deploys successfully and runs VM instances fine. However, live migration fails with the following:

2025-05-05T14:40:01.747623000Z libvirt.libvirtError: operation failed: job 'migration out' failed: address resolution failed for compute-node5:61152: Name or service not known
2025-05-05T14:40:01.810586000Z 2025-05-05 14:40:01.810 2 DEBUG nova.virt.libvirt.driver [None req-6ea75b73-61e0-4068-80e1-3aca268504b5 121d3d39bd634def98c0f3e824b52570 7df579c77c2e4141acd10edca9045c97 - - default default] [instance: 7c874c2e-200d-4c94-94f1-a191330a41ff] Live migration monitoring is all done _live_migration /usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py:10947

Upon further inspection, the computes are registered with short names where normally they would be <name>.ctlplane.<domain>.

oc rsh openstackclient openstack hypervisor list
-----------------------------------------------------------------------------------------

Hypervisor Hostname

Hypervisor Type

Host IP

State

-----------------------------------------------------------------------------------------

27e040bd-7907-4475-9fbb-89852bd27ccc	compute-node1	QEMU	x.x.x.x	up
813323dd-dacc-42c9-b04e-367fb46ac0e6	compute-node2	QEMU	x.x.x.x	down
efce87fb-c458-40fc-acda-b49d5c4ac800	compute-node4	QEMU	x.x.x.x	down
41b167fc-988f-4660-8f09-d77ef635476d	compute-node6	QEMU	x.x.x.x	down
2ed12ccc-3301-4d05-888f-61a9330419ac	compute-node5	QEMU	x.x.x.x	up
88a64875-6de1-4ed5-b6d3-dca33369ae9b	compute-node3	QEMU	x.x.x.x	up
00b64641-4951-4c69-accf-3ae21b20de6f	compute-node7	QEMU	x.x.x.x	down

-----------------------------------------------------------------------------------------

The resolv.conf has no dns domain and the system has no fqdn (hostname -f).

compute-node6]$ cat etc/resolv.conf

Generated by NetworkManager
nameserver x.x.x.x

$ cat sos_commands/host/hostname_-f
compute-node6

The underlying cause here is missing dns domain config in the OpenStackDataPlaneNodeSet (from example in step 9 of this procedure: https://docs.redhat.com/en/documentation/red_hat_openstack_services_on_openshift/18.0/html/deploying_red_hat_openstack_services_on_openshift/assembly_creating-the-data-plane#proc_creating-an-OpenStackDataPlaneNodeSet-CR-with-preprovisioned-nodes_dataplane)

network_config:

type: ovs_bridge
name: {{ neutron_physical_bridge_name }}
mtu: {{ min_viable_mtu }}
use_dhcp: false
dns_servers: {{ ctlplane_dns_nameservers }}
domain: {{ dns_search_domains }} #<<<<<<MISSING

After adding this config back nova_compute will fail to restart with this error:

May 8 09:21:12 compute-node6 nova_compute[94947]: 2025-05-08 09:21:12.462 2 ERROR oslo_service.service nova.exception.InvalidConfiguration: My compute node 41b167fc-988f-4660-8f09-d77ef635476d has hypervisor_hostname compute-node6 but virt driver reports it should be compute-node6.ctlplane.<DOMAIN>. Possible rename detected, refusing to start!

Removing and re-adding the compute node in nova does not work as there are protections to prevent hostname changes or re-registering computes with the same UUID it seems.

1. Is there a way to fix this without redeploying the EDPM node or a manual DB update?

2. A user customizing the network_config may not realize how important this config is if it cannot be fixed post deployment. The missing config should result in a deployment failure IMO.

Thanks for looking at this issue.

links to

KCS

openstack-k8s-operators/edpm-ansible#622: Check an FQDN is set

RHBA-2025:152103 Release of containers for RHOSO OpenStack EDPM images

Assignee:: James Slagle

Reporter:: Mathew Flusche

Team:: rhos-dfg-df

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2025/05/09 6:58 PM

Updated:: 2025/07/31 2:16 PM

Resolved:: 2025/07/31 2:16 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty