Loading...

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: rhos-16.2.z, rhos-16.2.async1
Affects Version/s: None
Component/s: ansible-collections-openstack
Labels:
- Triaged

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Bugzilla Bug:
RHBZ: 2314933
Dev Approval:
Committed
PM Approval:
Committed
QE Approval:
Committed
Regression:
None
Intelligence Requested:
Market:
Errata Link:
https://errata.engineering.redhat.com/advisory/139413

Severity:
Important

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

+++ This bug was initially created as a clone of Bug #2295401 +++

Description of problem:
This happens because in the past we had the role 'Member' , while in osp17.1 it expects the role 'member' in small letters, this gives a conflict.

Version-Release number of selected component (if applicable):
16.2 > 17.1

How reproducible:
100% if you have OSP since 13 and have upgraded

Steps to Reproduce:
1. # openstack overcloud upgrade run --yes --stack openstack05 --debug --limit allovercloud,undercloud --playbook all

Actual results:
2024-06-15 16:45:36.450387 |
e0071b6a-fbb0-5953-39aa-00000008e05e

FATAL

Check Keystone role status

undercloud

item=member

error={"ansible_job_id": "765084055976.244733", "ansible_loop_var": "tripleo_keystone_resources_role_async_result_item", "attempts": 1, "changed": false, "extra_data":
{"data": null, "details": "Conflict
occurred attempting to store role - Duplicate entry found with name member.", "response": "{\"error\":{\"code\":409,\"message\":\"Conflict
occurred attempting to store role - Duplicate entry found with name member.\",\"title\":\"Conflict\"}}\n"}, "finished": 1, "msg": "Failed
to create role member: Client Error for url: https://oscar23.tc.lab.corp:13000/v3/roles, Conflict occurred attempting to store role - Duplicate entry found with name
member.", "tripleo_keystone_resources_role_async_result_item":
{"ansible_job_id": "765084055976.244733", "ansible_loop_var": "tripleo_keystone_resources_role", "changed": true, "failed": false, "finished": 0, "results_file": "/root/.ansible_async/765084055976.244733", "started": 1, "tripleo_keystone_resources_role": "member"}

}

Expected results:
No errors

Additional info:
Can be fixed with:

openstack role delete Member

— Additional comment from Grzegorz Grasza on 2024-08-26 13:33:50 UTC —

I'm closing this, since we are very close to the last 17.1 release and I won't be able to have a complete solution on time.

The quick workaround is running:

openstack role delete Member

The issue with running this indiscriminately during an upgrade to 17.1 is that we don't know if the role was in any way modified between the upgrades. It might be best to leave running this command to the end user, in the hope that they know what they are doing (i.e. that they didn't do any changes to the Member role).

— Additional comment from Kenny Tordeurs on 2024-08-26 13:53:24 UTC —

(In reply to Grzegorz Grasza from comment #1)
> I'm closing this, since we are very close to the last 17.1 release and I
> won't be able to have a complete solution on time.
>
> The quick workaround is running:
>
> # openstack role delete Member
>
> The issue with running this indiscriminately during an upgrade to 17.1 is
> that we don't know if the role was in any way modified between the upgrades.
> It might be best to leave running this command to the end user, in the hope
> that they know what they are doing (i.e. that they didn't do any changes to
> the Member role).

Can we add this to known issues into the documentation?
Thank you

— Additional comment from Alex Stupnikov on 2024-09-04 11:25:39 UTC —

Workaround likely triggered bug #2309586 in Heat

— Additional comment from Kenny Tordeurs on 2024-09-05 08:57:33 UTC —

Raising urgency because of 03916927

Proximus is planning the upgrade of their biggest production cluster on Friday 6 September in the afternoon.

We recently hit an issue which turned out was caused because of the actions taken in https://bugzilla.redhat.com/show_bug.cgi?id=2295401 and we absolutely would need to avoid this issue, so we would really need to get a hotfix or workaround to avoid this problem during the large production cluster upgrade.

If we cannot provide a proper fix to avoid this problem we will have to put the upgrade on hold and this will cause a significant impact at the customer as upper management is heavily following up on this and they cannot plan another maintenance window this year, which means everything would have to be put on hold until 2025.

The deadline cannot be missed as it will impact not only vendors such a Nokia but also the voice application which is used across Belgium
This issue has become particularly prominent at the customer management level, as Red Hat is already in scope as the customer has committed to spending many millions of dollars on Azure and is considering migrating Red Hat workloads to Azure AKS. Belgacom is a major customer

— Additional comment from David Hill on 2024-09-16 18:43:32 UTC —

The workaround "works" in the sense that we fail later on now with:
~~~
2024-09-08 00:43:35.948587 | 9440c985-b930-3826-01ed-00000000262d | FATAL | Check Keystone user assignment to roles status | undercloud | item=swift | error={"ansible_job_id": "98192925416.91408", "ansible_loop_var": "tripleo_keystone_resources_user_role_async_result_item", "attempts": 2, "changed": false, "finished": 1, "msg": "Role member is not valid", "results_file": "/root/.ansible_async/98192925416.91408", "started": 1, "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": [], "tripleo_keystone_resources_user_role_async_result_item": {"ansible_job_id": "98192925416.91408", "ansible_loop_var": "tripleo_keystone_resources_data_user", "changed": true, "failed": 0, "finished": 0, "results_file": "/root/.ansible_async/98192925416.91408", "started": 1, "tripleo_keystone_resources_data_user": {"swift": {"project": "service"}}}}
~~~

Perhaps the customer should just tweak the following:
~~~
keystone_resources:
swift:
endpoints:
public:

{get_param: [EndpointMap, CephRgwPublic, uri]}

internal:

{get_param: [EndpointMap, CephRgwInternal, uri]}

admin:

{get_param: [EndpointMap, CephRgwAdmin, uri]}

users:
swift:
password:

{get_param: SwiftPassword}

roles:

admin
member
region: {get_param: KeystoneRegion}
service: 'object-store'
roles:
member
ResellerAdmin
swiftoperator
~~~
to:
~~~
keystone_resources:
swift:
endpoints:
public: {get_param: [EndpointMap, CephRgwPublic, uri]}
internal:
{get_param: [EndpointMap, CephRgwInternal, uri]}
admin:
{get_param: [EndpointMap, CephRgwAdmin, uri]}
users:
swift:
password:
{get_param: SwiftPassword}
roles:
admin
member
region: {get_param: KeystoneRegion}
service: 'object-store'
roles:
Member <===================================================================
ResellerAdmin
swiftoperator
~~~

Unless we can make the ansible module case insensitive ? I've tried reproducing this with the CLI and I can't , it's just with ansible that I can reproduce this issue . When I try to assign Admin instead of admin to a user in my 17.1 lab, it works but somehow ansible just doesn't like this.

— Additional comment from David Hill on 2024-09-16 18:53:56 UTC —

~~~
r = self.conn.identity.find_role(role)
if r is None:
self.fail_json(msg="Role %s is not valid" % role)
filters['role'] = r['id']
~~~

Why is this case sensitive ? The issue we have is that somehow ansible or the openstack plugin called by ansible is case sensitive when it should NOT because keystone is NOT case sensitive.

— Additional comment from David Hill on 2024-09-16 18:56:37 UTC —

Or maybe it's just this here that is case sensitive ? That'd probably be a keystone bug then.

~~~
def find_role(self, name_or_id, ignore_missing=True):
"""Find a single role

:param name_or_id: The name or ID of a role.
:param bool ignore_missing: When set to ``False``
:class:`~openstack.exceptions.ResourceNotFound` will be
raised when the role does not exist.
When set to ``True``, None will be returned when
attempting to find a nonexistent role.
:returns: One :class:`~openstack.identity.v3.role.Role` or None
"""
return self._find(_role.Role, name_or_id,
ignore_missing=ignore_missing)
~~~

— Additional comment from Denise Hughes on 2024-09-17 17:39:32 UTC —

TRAC team has reviewed and approved this blocker request https://issues.redhat.com/browse/OSP-32797

— Additional comment from Ade Lee on 2024-09-20 06:35:36 UTC —

Here are a couple notes to explain the suggested changes to solve this issue.
I'm adding them here because its a little too much to add to a commit message.

The issue here is that the ansible module identity_role in ansible-collections-openstack
does not understand that keystone will treat the role_name in a case insensitive manner -
and will throw a 409 exception when trying to add two roles with different casing ("member" vs. "Member").

We need to fix the ansible module to treat role names in a case insensitive manner.

Now, the ansible module uses python calls in the openstacksdk to make calls to keystone. So, the question
becomes where to make the relevant changes. It turns out that we need to do it in both places.

Case-insensitivity is a hot mess in keystone. Essentially, whether case-insensitivity is enforced depends on
the back-end. [1] What this means is that it is very difficult to make any changes in the SDK to standardize
this behavior without changing the behavior for some backend upstream. This will need agreement and deprecation
etc.

Instead, what we cam do is provide an option for the SDK to treat roles in a case insensitive manner.
The patch [2] does exactly that, allowing the SDK to treat the role name as case insensitive when the kwarg
case_insensitive=True is passed in. This parameter defaults to False so that we do not change the SDK
behavior by default, which means that this patch can be ported upstream.

Now, when we know that the keystone backend will treat the role name with case-insensitivity, we can
call the SDK with this kwarg set. We absolutely do know that this is the case for RHOS 17.1 customers, so
we can set this kwarg in 17.1 in ansible-collections-openstack. Thats what patch [3] does.

Note that because the module checks for the existence of the role first, we should never run into the
409 Conflict exception scenario.

There was some discussion about possibly changing the create_role logic in the SDK as in [4]. I think this
would be incorrect. The reason is because keystone will return a 409 even if the casing is not different.
That is, if a role "foobar" exists, a subsequent SDK call to create_role("foobar") will also generate a 409
from keystone (for any backend). Right now, the expectation is that if you use create_role, you should
either check for the role first (like the ansible module does) or catch the exception. If we change the logic
here, then we change that contract - and thats also something that needs upstream discussion/ deprecation etc.

Fortunately, there is not need to do this.

[1] https://docs.openstack.org/keystone/wallaby/admin/case-insensitive.html
[2] https://code.engineering.redhat.com/gerrit/c/python-openstacksdk/+/453499
[3] https://code.engineering.redhat.com/gerrit/c/ansible-collections-openstack/+/453538
[4] https://code.engineering.redhat.com/gerrit/c/python-openstacksdk/+/453352/5/openstack/cloud/_identity.py

— Additional comment from Rabi Mishra on 2024-09-20 06:53:22 UTC —

> That is, if a role "foobar" exists, a subsequent SDK call to create_role("foobar") will also generate a 409
from keystone (for any backend).

Right, but don't we fetch the existing role in that case and return in [4], so no new role would in-effect be created.

> [2] https://code.engineering.redhat.com/gerrit/c/python-openstacksdk/+/453499

This needs to be proposed upstream, else you would have to carry it forward downstream like what was argued earlier for [4].

— Additional comment from Ade Lee on 2024-09-20 14:47:10 UTC —

Rabi,

The concern I have with [4], (other than the fact that it changes the contract between caller and SDK without upstream agreement)
is that it doesn't work well for case sensitive backends.

For a case-sensitive backend, Consider the case where a role "Member" exists. A call to create_role("member") will succeed, and now we will have two roles
("Member" and 'member"). A subsequent call to create_role("member") will result in a 409 exception.

With the new code in [4], we'd catch that exception and do a case-insensitive search for the role. Whether we end up returning "member" or "Member"
depends on the order of the search results - and may not be correct.

Agreed on proposing something similar to [2] upstream. Will work on that today or next week.

— Additional comment from Rabi Mishra on 2024-09-24 04:21:42 UTC —

> is that it doesn't work well for case sensitive backends.

Unless we propose the change upstream, this argument is invalid. It seems we moved this to to MODIFIED without an upstream proposal for openstacksdk, which does not look right to me unless we're tracking it elsewhere.

— Additional comment from Mike Burns on 2024-09-25 13:23:43 UTC —

fast track exception+ for delivered hotfix

— Additional comment from Dave Wilde on 2024-09-25 18:26:51 UTC —

There is some confusion regarding this issue so I'm adding this note to hopefully clarify what is going on with the Duplicate Role issue affecting some customers. Currently there are two bugs, [1] and [2], that are open for this issue and there are a total of three patches: [3], [4], and [5] that will be released with 17.1.4 to resolve the issue going forward. There are also two customer cases [6] and [7] one of which is closed and the other open.

To address the customer concerns regarding the latest upgrade on their production environment we read this request:

>"My strong preference is that we stick with the new member definition all over the place, so our environments don't divert from each other.
>If we now just workaround this in code, the chance is also pretty high we will just be hit again by this issue in a subsequent upgrade.
>We only need to be 100% that fixing that trust is the only thing we need to do."
>>just to confirm my understanding, you want to go with the same fix you already applied on 6 environments, right ?
>"Well that would be my preference if it doen't have downside according to RH.
>Othwise we end up with 2 different setups, which will just fail next time when we do a next upgrade.
>As Member is not something RH expects to be there."

As the customer would like to perform the same procedure of removing the offending Member role and updating the trusts in the database manually as described in this comment [8] so that their environments all match. This course of action does not require the hot-fixes provided by the Security DFG. The Security DFG engineers did not perform or advise on this procedure but based on the context of the ticket the procedure has been performed on 6 environments already and presents a relatively low risk.

The fixes that will go into 17.1.4 should ensure that we do not encounter this issue going forward in that they modify the SDK to have the ability to ignore case for role names in [3] and then use that ability in [4] to have the openstack module ignore case. This allows us to revert [5] the partial fix we attempted in tripleo-ansible. The effect of running with these patches is that the existing Member role will not cause any issues, and there should be no disruption in the upgrade due to this issue. It will mean though that the configuration will be different from the previously upgraded deployments.

Hopefully this clarifies where we are with both the 17.1.4 fixes for the duplicate role member problem as well as the path forward for the customer. Please reach out to the Security DFG with any questions, comments, or concerns.

/Dave

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2295401
[2] https://bugzilla.redhat.com/show_bug.cgi?id=2309586
[3] https://code.engineering.redhat.com/gerrit/c/python-openstacksdk/+/453546
[4] https://code.engineering.redhat.com/gerrit/c/ansible-collections-openstack/+/453547
[5] https://code.engineering.redhat.com/gerrit/c/tripleo-ansible/+/453563
[6] https://access.redhat.com/support/cases/#/case/03916927
[7] https://access.redhat.com/support/cases/#/case/03925954
[8] https://bugzilla.redhat.com/show_bug.cgi?id=2309586#c19

— Additional comment from Ade Lee on 2024-09-26 16:05:41 UTC —

As per @ktordeur , the customer would like to use these fixes in an upgrade that is occurring this weekend. They will test it out and if it doesn't work, they'll fall back to deleting the member role.
Note that these are the builds that are going into 17.1.4 and as such, they have not yet gone through our QE testing. We'd definitely like the customer feedback.

The builds have been completed, tagged as hotfix and signed. They can be retrieved at the following locations:

https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=3302229 (tripleo-ansible)
https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=3302230 (python-openstacksdk)
https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=3302228 (ansible-collections-openstack)