-
Bug
-
Resolution: Done-Errata
-
Major
-
rhos-18.0 FR 2 (Mar 2025)
-
3
-
False
-
-
False
-
?
-
openstack-ansible-ee-container-1.0.11-8
-
rhos-ops-day1day2-edpm
-
None
-
-
-
EDPM Sprint 2, EDPM Sprint 3, EDPM Sprint 4, EDPM Sprint 5
-
4
-
Important
This is kind of an ugly issue and I'm not sure of a fix other than re-deploying the edpm nodes (or db surgury). Hopefully there is an easier solution.
New RHOSO 18 environment - deploys successfully and runs VM instances fine. However, live migration fails with the following:
2025-05-05T14:40:01.747623000Z libvirt.libvirtError: operation failed: job 'migration out' failed: address resolution failed for compute-node5:61152: Name or service not known
2025-05-05T14:40:01.810586000Z 2025-05-05 14:40:01.810 2 DEBUG nova.virt.libvirt.driver [None req-6ea75b73-61e0-4068-80e1-3aca268504b5 121d3d39bd634def98c0f3e824b52570 7df579c77c2e4141acd10edca9045c97 - - default default] [instance: 7c874c2e-200d-4c94-94f1-a191330a41ff] Live migration monitoring is all done _live_migration /usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py:10947
Upon further inspection, the computes are registered with short names where normally they would be <name>.ctlplane.<domain>.
- oc rsh openstackclient openstack hypervisor list
-----------------------------------------------------------------------------------------ID Hypervisor Hostname Hypervisor Type Host IP State -------------------------------------
----------------------------------------------------27e040bd-7907-4475-9fbb-89852bd27ccc compute-node1 QEMU x.x.x.x up 813323dd-dacc-42c9-b04e-367fb46ac0e6 compute-node2 QEMU x.x.x.x down efce87fb-c458-40fc-acda-b49d5c4ac800 compute-node4 QEMU x.x.x.x down 41b167fc-988f-4660-8f09-d77ef635476d compute-node6 QEMU x.x.x.x down 2ed12ccc-3301-4d05-888f-61a9330419ac compute-node5 QEMU x.x.x.x up 88a64875-6de1-4ed5-b6d3-dca33369ae9b compute-node3 QEMU x.x.x.x up 00b64641-4951-4c69-accf-3ae21b20de6f compute-node7 QEMU x.x.x.x down -------------------------------------
----------------------------------------------------
The resolv.conf has no dns domain and the system has no fqdn (hostname -f).
compute-node6]$ cat etc/resolv.conf
- Generated by NetworkManager
nameserver x.x.x.x
$ cat sos_commands/host/hostname_-f
compute-node6
The underlying cause here is missing dns domain config in the OpenStackDataPlaneNodeSet (from example in step 9 of this procedure: https://docs.redhat.com/en/documentation/red_hat_openstack_services_on_openshift/18.0/html/deploying_red_hat_openstack_services_on_openshift/assembly_creating-the-data-plane#proc_creating-an-OpenStackDataPlaneNodeSet-CR-with-preprovisioned-nodes_dataplane)
network_config:
- type: ovs_bridge
name: {{ neutron_physical_bridge_name }}
mtu: {{ min_viable_mtu }}
use_dhcp: false
dns_servers: {{ ctlplane_dns_nameservers }}
domain: {{ dns_search_domains }} #<<<<<<MISSING
After adding this config back nova_compute will fail to restart with this error:
May 8 09:21:12 compute-node6 nova_compute[94947]: 2025-05-08 09:21:12.462 2 ERROR oslo_service.service nova.exception.InvalidConfiguration: My compute node 41b167fc-988f-4660-8f09-d77ef635476d has hypervisor_hostname compute-node6 but virt driver reports it should be compute-node6.ctlplane.<DOMAIN>. Possible rename detected, refusing to start!
Removing and re-adding the compute node in nova does not work as there are protections to prevent hostname changes or re-registering computes with the same UUID it seems.
1. Is there a way to fix this without redeploying the EDPM node or a manual DB update?
2. A user customizing the network_config may not realize how important this config is if it cannot be fixed post deployment. The missing config should result in a deployment failure IMO.
Thanks for looking at this issue.
- links to
-
RHBA-2025:152103 Release of containers for RHOSO OpenStack EDPM images