Loading...

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: 6.17.6
Affects Version/s: 6.15.0, 6.16.0, 6.17.0
Component/s: Candlepin, Performance
Labels:
- snap-6.17.6-1.0
- triaged

Story Points:
1
Blocked:
False
Fixed in Build:
candlepin-4.6.4-1,candlepin-4.5.2-2,candlepin-4.4.23-1
AssignedTeam:
sat-proton

Release Note Type:
None
Release Note Text:
None
Release Note Status:
None

PX Impact Score:
PX Priority Data:
PX Review Complete:
SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Test Coverage:
None

Market:

Description of problem:
Having a stable Satellite with no restart of candlepin, and generating a really lot of requests to it, the PostgreSQL will stop accepting new commands to all databases as a prevention of (transaction) wraparound data loss.

The reason is that candlepin startup leaves an open transaction:

SELECT * FROM public.databasechangelog ORDER BY DATEEXECUTED ASC, ORDEREXECUTED ASC

forever. As postgres assigns a transaction id (XID) to it and the pool of transaction IDs is common to all databases, and since the XID is of type `int` with range of approx 2 billions transactions, generating many requests that require a transaction hits a transaction wraparound problem (see e.g. https://www.postgresql.fastware.com/blog/how-to-fix-transaction-wraparound-in-postgresql for a description of the problem).

TL;DR on a busy Satellite with no tomcat/candlepin restart, your postgres will stop accepting commands to all databases. Candlepin, pulp, foreman are completely blocked.

A preventive workaround is trivial, to restart candlepin say once per week. But we should avoid the hung transaction against candlepin DB as the consequences of the issue are fatal.

How reproducible:
100% after some time

Is this issue a regression from an earlier version:
It seems to be a regression starting in Sat 6.15. 6.14 or older dont have the hung transaction.

Steps to Reproduce:

1. Have Satellite running and check open transactions:

su - postgres -c "psql -c \"COPY (SELECT pid,usesysid,datname,client_port,backend_start,xact_start,state,backend_xid,backend_xmin,substring(query for 60) FROM pg_stat_activity WHERE state IN ('idle in transaction', 'active')) TO STDOUT DELIMITER ';'\""

2. Generate A LOT OF load, and wait a few weeks. Or.. mimic the load by mimicking the DB transactions as follows: create a new DB and 50 tables on it, and do dummy updates over records there:

su - postgres -c "psql -c \"CREATE DATABASE manytrans\""
for i in $(seq 1 50); do echo "CREATE TABLE table${i} (count int);"; done | su - postgres -c "psql manytrans"
for i in $(seq 1 50); do echo "INSERT INTO table${i} (count) VALUES (0);"; done | su - postgres -c "psql manytrans"

Then start generating the transactions:

for i in $(seq 1 50); do
         echo "( while true; do echo \"BEGIN; UPDATE table${i} SET count=0; END;\"; done | su - postgres -c \"psql $D\" > /dev/null; ) &"
done > kick_off_transactions_in_loop.sh
chmod a+x kick_off_transactions_in_loop.sh
nohup ./kick_off_transactions_in_loop.sh

As you will have many dead rows, run vacuum periodically to keep performance:

while true; do
    su - postgres -c "psql manytrans -c \"VACUUM FULL\""
    sleep 60
done

Optionally, generate some load triggering transactions over candlepin/foreman/pulp -e.g. reregister systems, or directly do some dummy "BEGIN; UPDATE table SET ..; END" transactions against foreman/candlepin/pulpcore DB.

3. Monitor pending transactions per command from first point, plus check postgres logs.

4. Depending on HW (32 CPUs recommended) and how scaled test you run, wait a few days (it took me almost precisely 2 days).

Confirmation that you are on a good track: after few hours, you will start seeing warnings in postgres logs:

2025-07-25 16:27:35 CEST WARNING:  oldest xmin is far in the past

Actual behavior:

Open transactions query will always have:

70576;16384;candlepin;35454;2025-07-25 11:37:50.06021+02;2025-07-25 11:37:50.228839+02;idle in transaction;\N;320065724;SELECT * FROM public.databasechangelog ORDER BY DATEEXECUTED

where the `320065724` is the "blocking" transaction ID. Pending transactions will be like:

342328;10;manytrans;-1;2025-07-25 18:02:08.210974+02;2025-07-27 12:59:55.39222+02;idle in transaction;2466414943;\N;UPDATE table7 SET count=0\;

where the `2466414943` is the currently active transaction ID - and in this case higher by 2146349219 than the blocked transaction ID. Now you start to have blocked postgres/Satellite. And you will start seeing postgres errors:

2025-07-27 13:00:03 CEST HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
	You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
2025-07-27 13:00:03 CEST WARNING:  database "postgres" must be vacuumed within 1000003 transactions
2025-07-27 13:00:03 CEST HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
	You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
2025-07-27 13:00:03 CEST ERROR:  database is not accepting commands to avoid wraparound data loss in database "postgres"
2025-07-27 13:00:03 CEST HINT:  Stop the postmaster and vacuum that database in single-user mode.
	You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
2025-07-27 13:00:03 CEST STATEMENT:  UPDATE katello_cdn_configurations SET custom_cdn_auth_enabled=FALSE;
2025-07-27 13:00:03 CEST ERROR:  database is not accepting commands to avoid wraparound data loss in database "postgres"
2025-07-27 13:00:03 CEST HINT:  Stop the postmaster and vacuum that database in single-user mode.

or foreman errors:

2025-07-27T13:00:17 [E|dyn|] PersistenceError in executor
2025-07-27T13:00:17 [E|dyn|] caused by Sequel::DatabaseError: PG::ProgramLimitExceeded: ERROR:  database is not accepting commands to avoid wraparound data loss in database "postgres"
 | HINT:  Stop the postmaster and vacuum that database in single-user mode.
 | You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
 |  (Dynflow::Errors::PersistenceError)
 | /usr/share/gems/gems/sequel-5.83.1/lib/sequel/adapters/postgres.rb:171:in `exec'
..

(or candlepin or pulp errors as well)

Expected behavior:
No such pending transaction open for days. No such fatal postgres error at the end.

Business Impact / Additional info:
Business impact: The issue is dirty ticking bomb. People dont check xmin wanings so they suddenly hit the halting postgres out of nowhere. Especially the most busy Satellites with the biggest impact to customer environment are more likely affected.

Hint to devels: DuckDuckGo suggests the transaction was seen in liquidbase, some Java library used by candlepin. We noticed the transaction on most 6.15+ Satellites but on none 6.14 or older. BUT liquidbase library has same version on 6.14 and 6.15, both Satellites have `/var/lib/tomcat/webapps/candlepin/WEB-INF/lib/liquibase-core-4.23.2.redhat-00001.jar`, so..?

clones

SAT-36234 Busy Satellite with no candlepin restart hits "database is not accepting commands" fatal error

Closed

is duplicated by

SAT-35428 Bug Report: Candlepin causes 'Idle in Transaction' lock leading to 504 Gateway Timeouts

Closed

links to

https://access.redhat.com/solutions/7128578

RHSA-2025:155805 Satellite 6.17.6 Async Update

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates