-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
None
-
2
-
False
-
-
False
-
?
-
?
-
?
-
?
-
None
-
-
-
Moderate
When the "StorageMgmt" network is defined in the NodeSet the EDPM node gets deployed in a way that canonical_hostname points to the ctlplane while the dns search order has the storage network in the first place and therefore hostname -f returns the storage FQDN. This leads to cert validation errors during live migration.
Apr 30 08:31:03 compute-2 virtqemud[106408]: QEMU_MONITOR_RECV_REPLY: mon=0x7fdd50008180 reply={"return": {"status": "failed", "error-desc": "Certificate does not match the hostname compute-1.storagemgmt.example.com"}, "id": "libvirt-438"} Apr 30 08:31:03 compute-2 virtqemud[106408]: operation failed: job 'migration out' failed: Certificate does not match the hostname compute-1.storagemgmt.example.com Apr 30 08:31:04 compute-2 virtqemud[106408]: internal error: QEMU unexpectedly closed the monitor (vm='instance-00000091'): 2024-04-30T08:31:03.714360Z qemu-kvm: Cannot read from TLS channel: Input/output error 2024-04-30T08:31:03.714746Z qemu-kvm: Cannot read from TLS channel: Input/output error 2024-04-30T08:31:03.714863Z qemu-kvm: Cannot read from TLS channel: Input/output error 2024-04-30T08:31:03.715030Z qemu-kvm: Not a migration stream 2024-04-30T08:31:03.715205Z qemu-kvm: load of migration failed: Invalid argument [root@compute-2 ~]# cat /etc/resolv.conf # Generated by NetworkManager search storagemgmt.example.com ctlplane.example.com internalapi.example.com storage.example.com tenant.example.com ocp.openstack.lab nameserver 192.168.122.80 [root@compute-2 ~]# hostname -f compute-2.storagemgmt.example.com [zuul@controller-0 ~]$ oc get secret dataplanenodeset-openstack-edpm -o json|jq -r '.data["inventory"]'|base64 -d|grep can canonical_hostname: compute-0.ctlplane.example.com canonical_hostname: compute-1.ctlplane.example.com canonical_hostname: compute-2.ctlplane.example.com [root@compute-2 ~]# cat /etc/os-net-config/config.yaml |grep domain domain: ['storagemgmt.example.com', 'ctlplane.example.com', 'internalapi.example.com', 'storage.example.com', 'tenant.example.com'] [zuul@controller-0 architecture]$ oc get secret dataplanenodeset-openstack-edpm -o json|jq -r '.data["inventory"]'|base64 -d|grep dns_search_domains: -A5 dns_search_domains: - storagemgmt.example.com - ctlplane.example.com - internalapi.example.com - storage.example.com - tenant.example.com -- dns_search_domains: - storagemgmt.example.com - ctlplane.example.com - internalapi.example.com - storage.example.com - tenant.example.com -- dns_search_domains: - storagemgmt.example.com - ctlplane.example.com - internalapi.example.com - storage.example.com - tenant.example.com [zuul@controller-0 architecture]$ oc get ipset compute-2 -o json|jq .status.reservations [ { "address": "172.20.0.103", "cidr": "172.20.0.0/24", "dnsDomain": "storagemgmt.example.com", "mtu": 1500, "network": "StorageMgmt", "subnet": "subnet1", "vlan": 23 }, { "address": "192.168.122.102", "cidr": "192.168.122.0/24", "dnsDomain": "ctlplane.example.com", "gateway": "192.168.122.1", "mtu": 1500, "network": "ctlplane", "routes": [ { "destination": "0.0.0.0/0", "nexthop": "192.168.122.1" } ], "subnet": "subnet1" }, { "address": "172.17.0.103", "cidr": "172.17.0.0/24", "dnsDomain": "internalapi.example.com", "mtu": 1496, "network": "internalapi", "subnet": "subnet1", "vlan": 20 }, { "address": "172.18.0.103", "cidr": "172.18.0.0/24", "dnsDomain": "storage.example.com", "mtu": 1496, "network": "storage", "subnet": "subnet1", "vlan": 21 }, { "address": "172.19.0.103", "cidr": "172.19.0.0/24", "dnsDomain": "tenant.example.com", "mtu": 1496, "network": "tenant", "subnet": "subnet1", "vlan": 22 } ]
This is caused by infra-operator ordering the network names alphabetically and in go StorageMgmt is smaller than ctlplane.
https://github.com/openstack-k8s-operators/infra-operator/blame/main/controllers/network/ipset_controller.go#L278-L281
As a WA only lower case network names should be used in the NodeSet and the name of the network names after golang shorting should result in ctlplane being the first.
The real solution could be to drop the lexicographical ordering of ip reservations, the infra-operator should keep the reservation order as is without reordering. The dataplane-operator can implement a validation webhook that ensures that the first network in the NodeSet is always ctlplane.
See the slack discussion as well https://redhat-internal.slack.com/archives/CQXJFGMK6/p1714474966545229
- split to
-
OSPRH-9455 Live migration fails with TLS cert error when ctlplane network is not listed as the first network
- Closed
- links to
- mentioned on