-
Sub-task
-
Resolution: Done
-
Undefined
-
None
-
None
-
1
-
False
-
-
False
-
0
-
sat-proton
-
-
-
Sprint 132, Sprint 133, Sprint 134, Sprint 135, Sprint 136, Sprint 137, Sprint 138, Sprint 139, Sprint 140, Sprint 141, Sprint 142
+++ This bug was initially created as a clone of Bug #2184151 +++
Description of problem:
When katello:clean_backend_objects rake script is running on a system with >1500 candlepin consumers at a time a consumer updates its facts, the rake script can wrongly detect a system is missing in candlepin, like:
Host 16080 hostname.example.com ff0a6edf-311c-4924-a7b9-b72707931c7b is partially missing subscription information. Un-registering
The reason is katello within ::Katello::Resources::Candlepin::Consumer::all_uuids call queries candlepin in pages:
2023-03-23 07:46:34,701 [thread=https-jsse-nio-127.0.0.1-23443-exec-9] [req=849f90be-9f62-40e9-9541-341e3055916b, org=, csid=] INFO org.candlepin.servlet.filter.logging.LoggingFilter - Request: verb=GET, uri=/candlepin/consumers/?owner=MyOrganization&include=uuid&per_page=1500&page=1
2023-03-23 07:48:50,123 [thread=https-jsse-nio-127.0.0.1-23443-exec-75] [req=a5d01c04-bf19-45e2-a42a-9b23493598c3, org=, csid=] INFO org.candlepin.servlet.filter.logging.LoggingFilter - Request: verb=GET, uri=/candlepin/consumers/?owner=MyOrganization&include=uuid&per_page=1500&page=2
2023-03-23 07:50:21,081 [thread=https-jsse-nio-127.0.0.1-23443-exec-26] [req=40f37098-fbed-40a6-972e-f70510000ced, org=, csid=391be221-c85d-4224-953b-d7df9143650b] INFO org.candlepin.servlet.filter.logging.LoggingFilter - Request: verb=PUT, uri=/candlepin/consumers/c8a22c7e-d959-42ef-b6a2-1eecbffd2459
2023-03-23 07:50:41,851 [thread=https-jsse-nio-127.0.0.1-23443-exec-20] [req=944317ac-69f3-45c5-9467-92b78734b276, org=, csid=e68d1403-cc68-4c79-a6a3-b8bd329fe9da] INFO org.candlepin.servlet.filter.logging.LoggingFilter - Request: verb=PUT, uri=/candlepin/consumers/8484b642-fbb1-43ff-800a-26e792c37dfc
2023-03-23 07:51:11,228 [thread=https-jsse-nio-127.0.0.1-23443-exec-63] [req=6b6a49be-eaff-4a16-9626-2ecc58bfd97e, org=, csid=] INFO org.candlepin.servlet.filter.logging.LoggingFilter - Request: verb=GET, uri=/candlepin/consumers/?owner=MyOrganization&include=uuid&per_page=1500&page=3
The problem occurs when a client updates consumer facts between the "get me next 1.5k consumers" requests - like the PUT requests above. This shuffles ordering of the consumers in a response, causing some UUID is skipped and some is present twice in the overall all_uuids call (we can easily demonstrate this).
Those skipped consumers are then wrongly marked as "partially missing subscription information".
Version-Release number of selected component (if applicable):
Sat 6.12.2 / any older as well
How reproducible:
very reliably
Steps to Reproduce:
1. Have a Satellite with >1.5k Content Hosts
2. invoke "foreman-rake katello:clean_backend_objects"
3. while it is running, run "subscription-manager facts --update" on a few hosts
4. check the rake script output
Actual results:
4. with some (high) probability, the rake script will wrongly detect a missing candlepin consumer
Expected results:
4. no such false alarms
Additional info:
We can easily demonstrate the "PUTing consumer facts shuffles consumers ordeing" on >10 hosts system as well. Just enable candlepin API from cmdline per https://access.redhat.com/solutions/2955931 , and replace "get me clients in 1.5k batches" requests by ".. in batches of, say, 7"
for i in $(seq 1 4); do curl -sk -H "Content-Type:application/json" -u admin:admin "https://localhost:23443/candlepin/consumers/?owner=YOUR_ORGANIZATION&include=uuid&per_page=7&page=${i}" | json_reformat; sleep 1; echo "uuid"; done | grep uuid
Normally, it will print something like:
"uuid": "08513930-d0f7-4882-8129-77673dc721c2"
"uuid": "cbc30fb8-9939-467d-9273-052eaf791eb4"
"uuid": "7ba62210-3c30-4a4c-ae0a-fb31df3bc33d"
"uuid": "f83e11cc-c0c3-479d-83ae-573fad2090bc"
"uuid": "9c8bbc55-ac77-4e2c-853d-a2a6fc24862a"
"uuid": "f72296a5-0b56-48b3-8712-f2ff4813e778"
"uuid": "762a8c10-2793-489e-a53f-a3ffb2e3acb0"
uuid
"uuid": "aac2882d-0bf6-4336-9d41-b30c94c452de"
"uuid": "60787875-b416-4aaf-bf1e-bd365639f4a8"
"uuid": "adb8a530-03a9-4c90-95b6-845b742adeb4"
"uuid": "f8bffcba-cd62-4a0f-b86d-f02a2f27e14f"
"uuid": "b8787ab1-58aa-4a14-b1f6-68eae436954e"
"uuid": "8f7d4e4e-6b7f-4498-9019-30dad9b8a62a"
"uuid": "f8677b63-d1b3-4495-a86e-00d415995bfd"
uuid
"uuid": "62d3e041-00c2-4adf-8135-055815954203"
"uuid": "f72296a5-0b56-48b3-8712-f2ff4813e778"
"uuid": "0e7893a7-fc9c-4b5b-97d7-b44c8b120071"
"uuid": "8f7d4e4e-6b7f-4498-9019-30dad9b8a62a"
"uuid": "663af0ad-4206-4294-a947-83fd5cfa612d"
"uuid": "adb8a530-03a9-4c90-95b6-845b742adeb4"
"uuid": "f8677b63-d1b3-4495-a86e-00d415995bfd"
uuid
"uuid": "7ba62210-3c30-4a4c-ae0a-fb31df3bc33d"
"uuid": "b8787ab1-58aa-4a14-b1f6-68eae436954e"
"uuid": "aac2882d-0bf6-4336-9d41-b30c94c452de"
"uuid": "82480bc1-3737-429c-8947-d1de745e9e27"
"uuid": "762a8c10-2793-489e-a53f-a3ffb2e3acb0"
uuid
Every time the same sequence. Now, during executing the script (play with the sleep time there), run "subscription-manager facts --update" on a client. The output will be e.g.:
"uuid": "08513930-d0f7-4882-8129-77673dc721c2"
"uuid": "cbc30fb8-9939-467d-9273-052eaf791eb4"
"uuid": "7ba62210-3c30-4a4c-ae0a-fb31df3bc33d"
"uuid": "f83e11cc-c0c3-479d-83ae-573fad2090bc"
"uuid": "9c8bbc55-ac77-4e2c-853d-a2a6fc24862a"
"uuid": "f72296a5-0b56-48b3-8712-f2ff4813e778"
"uuid": "762a8c10-2793-489e-a53f-a3ffb2e3acb0"
uuid
"uuid": "aac2882d-0bf6-4336-9d41-b30c94c452de"
"uuid": "60787875-b416-4aaf-bf1e-bd365639f4a8"
"uuid": "adb8a530-03a9-4c90-95b6-845b742adeb4"
"uuid": "f8bffcba-cd62-4a0f-b86d-f02a2f27e14f"
"uuid": "b8787ab1-58aa-4a14-b1f6-68eae436954e"
"uuid": "8f7d4e4e-6b7f-4498-9019-30dad9b8a62a"
"uuid": "f8677b63-d1b3-4495-a86e-00d415995bfd"
uuid
"uuid": "f83e11cc-c0c3-479d-83ae-573fad2090bc"
"uuid": "f72296a5-0b56-48b3-8712-f2ff4813e778"
"uuid": "0e7893a7-fc9c-4b5b-97d7-b44c8b120071"
"uuid": "8f7d4e4e-6b7f-4498-9019-30dad9b8a62a"
"uuid": "663af0ad-4206-4294-a947-83fd5cfa612d"
"uuid": "adb8a530-03a9-4c90-95b6-845b742adeb4"
"uuid": "f8677b63-d1b3-4495-a86e-00d415995bfd"
uuid
"uuid": "7ba62210-3c30-4a4c-ae0a-fb31df3bc33d"
"uuid": "b8787ab1-58aa-4a14-b1f6-68eae436954e"
"uuid": "aac2882d-0bf6-4336-9d41-b30c94c452de"
"uuid": "82480bc1-3737-429c-8947-d1de745e9e27"
"uuid": "762a8c10-2793-489e-a53f-a3ffb2e3acb0"
uuid
See the 62d3e041-.. is replaced by another f83e11cc-.., so the rake script won't get UUID 62d3e041 any more, during this run.
— Additional comment from on 2023-04-05T21:21:40Z
Nikos,
Should component be changed to Candlepin, since that's where the ordering of consumers comes from?
— Additional comment from on 2023-04-11T11:42:58Z
(In reply to Jeremy Lenz from comment #1)
> Nikos,
>
> Should component be changed to Candlepin, since that's where the ordering of
> consumers comes from?
Hi Jeremy, unfortunately, the skipped/duplicate item problem is fundamental when using Offset-based pagination (which candlepin does). The 'proper' way to fix this would be to move to Cursor-based pagination which does not have these problems, but that will be a big undertaking, requiring a proper design, etc. Feel free to create an RFE for it and we'll try to get to it when we can.
For a more immediate resolution, there may be a way to fix or at least make the problem less likely on your end by adjusting the query to use order_by, or filter the results it will return. I don't understand exactly what the clean_backend_objects script is trying to do, but I'll try.
Some questions/suggestions that may help:
- I suspect that because order_by is not used on the query (which would tell candlepin which of the consumer's field to use for ordering the results for), means that the 'updated' date is used by default. That PUT request to update facts would also update the updated date, and make that row 'jump up/down' in the list of rows in the db, which future pagination requests will then just not find. If you set &order_by=<field> for a field that does not change frequently (or at all; perhaps uuid?), this would help. However I don't think this will be fullproof, because even though you avoid this happening for consumers that just got updated, you'll still hit this for consumers that just got created, and their uuid happened to be ordered earlier in the list.
- The root of the problem is that you are using an open-ended query, so as long as that does not change you'll always have these problems. Does the script really need to fetch all consumers in the org? If you know already there's a list of suspected uuids you want to check if they exist in candlepin, you can pass them in using the &uuid= query param, and this problem won't happen.
- If that doesn't work and you still need to fetch all consumers in the org, you could do something like this:
- Use pagination to fetch all consumers of the org.
- Once you're done, and you see that a handful of consumers are 'missing' from the fetched results, try to double check they don't exist by querying for this list of uuids specifically (again, by specifying the &uuid= query param that accepts a list of uuids).
I hope this helps!
— Additional comment from on 2023-04-11T12:15:18Z
Nikos,
Thanks so much for the thoughtful and thorough response! Based on this I will keep the component as Hosts - Content as it seems feasible to use one of those workarounds.
— Additional comment from on 2023-04-29T00:04:10Z
Adding the sort_by paramater to the candlepin api request fixed this issue. Also, decreased the per_page api attribute (bulk_load_size) to 125, from 1500, which reduced the time the script ran by 8 times (from 25 minutes to 3 minutes):
~~~
https://github.com/Katello/katello/compare/master...math3matical:katello:bug2184151
~~~
Customer had 8 orgs, with a total of 17,368 hosts. The host count per orgs is:
~~~
641
9859
129
200
260
115
6065
99
~~~
The issue only existed with the org with count 6,065, and not the count with 9,859. Every time the katello clean_backend_objects script was ran it tried to remove about 1000 hosts from the group with 6,065 (with or without the maintenance mode enabled).
Testing the candlpin api directly, and testing with per_page values of 125, 250, 500, 1000, 1500, 2000, and 4000, we see the host uuids retrieved per second drops drastically as you increase the "per_page" value:
~~~
Per page is 125
real 0m1.536s <== 81 per second
Per page is 250
real 0m4.007s <== 62 per second
Per page is 500
real 0m14.050s <== 36 per second
Per page is 1000
real 0m56.440s <== 18 per second
Per page is 1500
real 2m15.842s <== 11 per second
Per page is 2000
real 4m11.618s <== 8 per second
Per page is 4000
real 19m38.112s <== 3 per second
~~~
— Additional comment from on 2023-08-09T14:56:58Z
One of our biggest Satellite EMEA customers (Belastingdienst) encountered this issue during Satellite 6.11 --> 6.12 upgrade.
Attached KCS [0] suggests running the clean_backend_objects in maintenance mode as a workaround. That however doesn't seem to always help as both Belastingdienst (case #03563869) and Aetna Inc. (case #03480134) hit this issue during upgrade.
As you can see below, the clean_backend_objects script is called during upgrade after maintenance mode is enabled. And the maintenance mode is lifted after the rake script finishes:
- /var/log/foreman-maintain/foreman-maintain.log
~~~
I, 2023-07-12 18:37:50+0200 #3818459 INFO – : — Execution step 'Add maintenance_mode tables/chain to nftables/iptables' [enable-maintenance-mode] started —
I, 2023-07-12 18:37:50+0200 #3818459 INFO – : — Execution step 'Add maintenance_mode tables/chain to nftables/iptables' finished —
...
...
...
I, 2023-07-12 18:59:05+0200 #3818459 INFO – : — Execution step 'Procedures::Installer::Upgrade' finished —
I, 2023-07-12 18:59:05+0200 #3818459 INFO – : — Execution step 'Execute upgrade:run rake task' [installer-upgrade-rake-task] started —
D, 2023-07-12 18:59:05+0200 #3818459 DEBUG – : Running command foreman-rake upgrade:run with stdin nil
D, 2023-07-12 19:40:10+0200 #3818459 DEBUG – : output of the command:
=============================================
Upgrade Step 1/3: katello:correct_repositories. This may take a long while.
Processing Repository 1/5870: Red Hat Enterprise Linux 7 Server RPMs x86_64 7Server (9)
...
...
...
Processing Repository 5869/5870: mon-tooling-RHEL6-agent (30522)
Processing Repository 5870/5870: mon-tooling-RHEL9-agent (30523)
=============================================
Upgrade Step 2/3: katello:clean_backend_objects. This may take a long while.
Host 59121 onap0079.ont.belastingdienst.nl 6ffc6b97-38b2-44d7-b899-fe8f357c422f is partially missing subscription information. Un-registering
Failed to delete host#<RuntimeError: The Dynflow world was not initialized yet. If your plugin uses it, make sure to call Rails.application.dynflow.require! in some initializer>Host 32582 ac7p1055.acc.belastingdienst.nl af9b5da0-bcf9-477c-b3de-8382e0ecb4c2 is partially missing subscription information. Un-registering
Failed to delete host#<RuntimeError: The Dynflow world was not initialized yet. If your plugin uses it, make sure to call Rails.application.dynflow.require! in some initializer>Host 66033 ac1p6070.acc.belastingdienst.nl c1391997-dc6d-4f82-b73d-96ace4aa5f5e is partially missing subscription information. Un-registering
...
...
...
I, 2023-07-12 19:40:21+0200 #3818459 INFO – : — Execution step 'Remove maintenance mode table/chain from nftables/iptables' [disable-maintenance-mode] started —
I, 2023-07-12 19:40:21+0200 #3818459 INFO – : — Execution step 'Remove maintenance mode table/chain from nftables/iptables' finished —
~~~
In case #03563869, the clean_backend_objects deleted/un-registered 800+ hosts. That's customer's estimate, the production.log actually suggests there's more. It's full of destroy events such as:
- production.log
~~~
2023-07-12T19:40:05 [I|aud|] Host::Base (72620) destroy event on name awsoel19191.acc.belastingdienst.nl
2023-07-12T19:40:06 [I|aud|] Host::Base (22389) destroy event on name ppascpsl05.prod.belastingdienst.nl
2023-07-12T19:40:06 [I|aud|] Host::Base (72938) destroy event on name pwivadmbl19.prod.belastingdienst.nl
2023-07-12T19:40:06 [I|aud|] Host::Base (76587) destroy event on name owsoel30734.ont.belastingdienst.nl
2023-07-12T19:40:06 [I|aud|] Host::Base (35614) destroy event on name pmbskubl01.prod.belastingdienst.nl
2023-07-12T19:40:06 [I|aud|] Host::Base (72621) destroy event on name awivadmbl108.acc.belastingdienst.nl
2023-07-12T19:40:07 [I|aud|] Host::Base (71840) destroy event on name oqsoel13292.ont.belastingdienst.nl
2023-07-12T19:40:07 [I|aud|] Host::Base (72937) destroy event on name pwivadmbl15.prod.belastingdienst.nl
2023-07-12T19:40:07 [I|aud|] Host::Base (37351) destroy event on name pqeisdevl04.prod.belastingdienst.nl
2023-07-12T19:40:07 [I|aud|] Host::Base (33238) destroy event on name pnetnmol02.prod.bna.belastingdienst.nl
2023-07-12T19:40:08 [I|aud|] Host::Base (37360) destroy event on name pweisjobl01.prod.belastingdienst.nl
2023-07-12T19:40:08 [I|aud|] Host::Base (34839) destroy event on name pooscisl16.ont.belastingdienst.nl
2023-07-12T19:40:08 [I|aud|] Host::Base (37353) destroy event on name pqeisdevl07.prod.belastingdienst.nl
~~~
Link to production.log:
https://access.redhat.com/support/cases/#/case/03563869/discussion?attachmentId=a096R000037XYqIQAW
Link to foreman-maintain.log:
https://access.redhat.com/support/cases/#/case/03563869/discussion?attachmentId=a096R000037LPVaQAO
[0] https://access.redhat.com/solutions/7005998
— Additional comment from on 2024-03-08T18:28:27Z
Bryan,
Any chance we can get this bug addressed in a current release? The patch I wrote is very simple and not only fixes the initial issue of removing hosts, it also speeds up the process significantly.
Customer I wrote the patch for continues to hit this issue every time they upgrade and the file I wrote the package for gets overwritten.
Again, this always happens to the customer with or without maintenance mode enabled. The issue for this customer is directly related to how Candlepin is retrieving the uuids. It doesn't retrieve them in order, which makes a discrepancy of about 1000 hosts between katlello and candlepin, so Satellite removes 1000 hosts every time (and it isn't always the same hosts).
Kind regards,
— Additional comment from on 2024-03-25T13:51:47Z
Aligning to 6.15.0 based upon offline discussion.
— Additional comment from on 2024-03-27T09:01:26Z
Hi Michael and Jeremy,
rather double-check confirmation: The code fix in ':sort_by => "uuid"' will help also when virt-who is updating it mapping between the paginated "get me consumers" requests, am I right?
(we had a very unlucky customer where 6.12->6.13 upgrade run automatic clean_backend_objects script with COMMIT=true, and at exactly that "bad" time, virt-who sent its mapping - this ended up in 600 Hosts unregistered during the upgrade)
Also, when considering impacts of the sorting; shall not it be rather per `created` timestamp (ascending ordering)? What if a new Host is registered just during the clean_backend_objects execution - its new uuid can break the linearity/ordering of "get me consumers per pages" responses..
— Additional comment from on 2024-03-27T13:13:46Z
> shall not it be rather per `created` timestamp (ascending ordering)? What if a new Host is registered just during the clean_backend_objects execution - its new uuid can break the linearity/ordering of "get me consumers per pages" responses..
Hey Pavel
I thought about this as well. If we still have problems we can change it again to sort by created, but the reason we didn't do this from the start is that the "all_uuids" method we altered is currently restricted to only the "uuid" field and does not include the "created" field in the response.
This change should help when more consumers are added during the rake task, no matter if they're added by virt-who mapping or another method.
— Additional comment from on 2024-03-27T16:07:40Z
Hi Pavel/Jeremy,
The issue I discovered was because candlepin was duplicating ids, instead of getting a unique list. It happened with an organization with 6,065 host ids. Candlepin was returning 6,065 host uuids, but they were not all unique, it was getting duplicate hosts in its output. So really candlepin had roughly 5,000 unique ids, while katello had 6,065 unique ids. Which means they had a different count, so hosts got removed.
I don't think there is any linearity/ordering issue. The issue was just about a unique list of hosts ids. So if one is added, and is out of order, shouldn't matter (unless katello doesn't have that id in its list). I could have missed something, but that was my understanding of the code.
Kind regards,
— Additional comment from on 2024-03-28T03:44:31Z
Verified in 6.15.0 snap 13
the "::Katllo::Resource::Candlepin::Consumer.get" call is now sorted by uuids
Steps to Reproduce:
1.Set up Satellite with enabled candlepin api referenced here https://access.redhat.com/solutions/2955931
2.Register >10 hosts on the system
3.Execute the script (play with the sleep time there), run "subscription-manager facts --update" on the client at same time
4.Check sorting by calling ::Katllo::Resource::Candlepin::Consumer.get and running the script with sort_by=uuid
Both sorted by UUID:
foreman-rake console:
irb(main):002:0> Katello::Resources::Candlepin::Consumer.all_uuids
=> ["de205151-1e40-4d77-bcbd-896759f97c1f", "cab1b43d-cf94-462f-a6d6-aac8f55d5c25", "c0b18ed0-fff9-4462-9774-bf9a4fd14019", "bd451fda-cbd8-4628-a1ff-e29bd3f6805b", "b853fe9b-9ffb-42dc-abb7-1ac33107a496", "95439ac8-a54f-4810-815f-662c7432340a", "52048e3b-9b99-4e83-b913-f4f333347380", "4c48f841-b650-4fc5-9eb1-55cc36662612", "2930e3b9-ad72-4ffa-acdb-f5fdb152c6d9", "13d25474-5496-478e-b296-4ecd657a6c88"]
for i in $(seq 1 10); do curl -sk -H "Content-Type:application/json" -u admin:admin "https://localhost:23443/candlepin/consumers/?owner=lMwLfSW&sort_by=uuid&include=uuid&include=updated&include=created&per_page=1&page=${i}" | json_reformat; sleep 1; echo "uuid"; done
[
]
uuid
[
]
uuid
[
]
uuid
[
]
uuid
[
]
uuid
[
]
uuid
[
]
uuid
[
]
uuid
[
]
uuid
[
]
uuid
Additional info:
I have verified that the code change is in Satellite 6.15 but was not able to recreate the issue. After speaking with Development and Management, We have decided to verify this BZ since the issue was fixed, but create an RFE for potentially modifying this to sort using the "created" field
— Additional comment from on 2024-03-28T18:27:23Z
Link to RFE https://bugzilla.redhat.com/show_bug.cgi?id=2272088
QE Tracker for https://issues.redhat.com/browse/SAT-24286
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2272113