-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
rhos-17.1.z
-
None
-
5
-
False
-
-
False
-
?
-
openstack-tripleo-validations-14.3.2-17.1.20250120160809.2b526f8.el9osttrunk
-
None
-
-
-
5
-
Moderate
In case https://access.redhat.com/support/cases/#/case/04021756 (see also [1]) a customer got stuck upgrading from RHCSv4 to 5. The workaround was for them to `podman pull` the ceph container from their undercloud.
In the step 3 of the following chapter:
A specific validation playbook task fails like this:
(undercloud) [stack@lab-sp-director ~]$ openstack overcloud external-upgrade run --skip-tags ceph_ansible_remote_tmp --stack overcloud --tags cephadm_adopt 2>&1 -y ### output omitted ### 2024-12-03 18:01:48.581362 | 005056a3-3ff7-c9b1-f534-0000000000b3 | TASK | Check for valid ceph version during FFU 2024-12-03 18:01:48.669910 | 005056a3-3ff7-c9b1-f534-0000000000b3 | FATAL | Check for valid ceph version during FFU | undercloud | error={"changed": false, "msg": "Target ceph version cannot be for FFU."} ### output omitted ###
That ansible task was introduced with this:
https://code.engineering.redhat.com/gerrit/c/openstack-tripleo-validations/+/450495
It was put in place to avoid this scenario:
https://bugzilla.redhat.com/show_bug.cgi?id=2259286
The task in `roles/ceph/tasks/ceph-upgrade-version-check.yaml` always assumes that the first task below will succeed.
- name: Get Ceph version shell: "{{ container_client | default('podman') }} run --rm --entrypoint=ceph {{ ceph_container }} -v | awk '{print $5}'" register: ceph_version become: true vars: ceph_container: "{{ tripleo_cephadm_container_ns }}/{{ tripleo_cephadm_container_image }}:{{ tripleo_cephadm_container_tag }}" - name: Check for valid ceph version during FFU fail: msg: "Target ceph version cannot be {{ ceph_version.stdout }} for FFU." when: - ceph_version.stdout != 'pacific'
The code above was tested but only in a scenario where the ceph container is already on the undercloud. This is our default scenario. However, if the customer is using Satellite to host the image, then the first task above will not populate ceph_version.stdout with a ceph version string.
This tasks file should be able to handle that case. It should better handle when ceph_version.stdout with all whitespace removed is empty. If it at least output an error like this:
"Unable to determine ceph version by running $CMD on $HOST".
That alone would have given the customer insight into why the validation had failed.
It's up to the person fixing this bug how to handle that. Perhaps start with the above but also add something which tries to `podman pull` the container?
Also, the docs impact of this bug is that the documentation referenced above should point out that when Satellite is used to host the container image, that a copy of the ceph container still needs to be on the undercloud.
[1] https://groups.google.com/a/redhat.com/g/rhos-tech/c/NaAG9OKL4f4/m/G-QK8mDOBwAJ
rhos-tech@ subject "RHOSP 16.2 to 17.1 upgrade with Ceph deployed through Director failing"