Uploaded image for project: 'Satellite'
  1. Satellite
  2. SAT-39217

Busy Satellite with no candlepin restart hits "database is not accepting commands" fatal error

XMLWordPrintable

    • 1
    • False
    • candlepin-4.6.4-1,candlepin-4.5.2-2,candlepin-4.4.23-1
    • sat-proton
    • None
    • None
    • None
    • None

      Description of problem:
      Having a stable Satellite with no restart of candlepin, and generating a really lot of requests to it, the PostgreSQL will stop accepting new commands to all databases as a prevention of (transaction) wraparound data loss.

      The reason is that candlepin startup leaves an open transaction:

      SELECT * FROM public.databasechangelog ORDER BY DATEEXECUTED ASC, ORDEREXECUTED ASC
      

      forever. As postgres assigns a transaction id (XID) to it and the pool of transaction IDs is common to all databases, and since the XID is of type `int` with range of approx 2 billions transactions, generating many requests that require a transaction hits a transaction wraparound problem (see e.g. https://www.postgresql.fastware.com/blog/how-to-fix-transaction-wraparound-in-postgresql for a description of the problem).

      TL;DR on a busy Satellite with no tomcat/candlepin restart, your postgres will stop accepting commands to all databases. Candlepin, pulp, foreman are completely blocked.

      A preventive workaround is trivial, to restart candlepin say once per week. But we should avoid the hung transaction against candlepin DB as the consequences of the issue are fatal.

      How reproducible:
      100% after some time
       

      Is this issue a regression from an earlier version:
      It seems to be a regression starting in Sat 6.15. 6.14 or older dont have the hung transaction.
       

      Steps to Reproduce:

      1. Have Satellite running and check open transactions:

      su - postgres -c "psql -c \"COPY (SELECT pid,usesysid,datname,client_port,backend_start,xact_start,state,backend_xid,backend_xmin,substring(query for 60) FROM pg_stat_activity WHERE state IN ('idle in transaction', 'active')) TO STDOUT DELIMITER ';'\""
      

      2. Generate A LOT OF load, and wait a few weeks. Or.. mimic the load by mimicking the DB transactions as follows: create a new DB and 50 tables on it, and do dummy updates over records there:

      su - postgres -c "psql -c \"CREATE DATABASE manytrans\""
      for i in $(seq 1 50); do echo "CREATE TABLE table${i} (count int);"; done | su - postgres -c "psql manytrans"
      for i in $(seq 1 50); do echo "INSERT INTO table${i} (count) VALUES (0);"; done | su - postgres -c "psql manytrans"
      

      Then start generating the transactions:

      for i in $(seq 1 50); do
               echo "( while true; do echo \"BEGIN; UPDATE table${i} SET count=0; END;\"; done | su - postgres -c \"psql $D\" > /dev/null; ) &"
      done > kick_off_transactions_in_loop.sh
      chmod a+x kick_off_transactions_in_loop.sh
      nohup ./kick_off_transactions_in_loop.sh
      

      As you will have many dead rows, run vacuum periodically to keep performance:

      while true; do
          su - postgres -c "psql manytrans -c \"VACUUM FULL\""
          sleep 60
      done
      

      Optionally, generate some load triggering transactions over candlepin/foreman/pulp -e.g. reregister systems, or directly do some dummy "BEGIN; UPDATE table SET ..; END" transactions against foreman/candlepin/pulpcore DB.

      3. Monitor pending transactions per command from first point, plus check postgres logs.

      4. Depending on HW (32 CPUs recommended) and how scaled test you run, wait a few days (it took me almost precisely 2 days).

      Confirmation that you are on a good track: after few hours, you will start seeing warnings in postgres logs:

      2025-07-25 16:27:35 CEST WARNING:  oldest xmin is far in the past
      

      Actual behavior:

      Open transactions query will always have:

      70576;16384;candlepin;35454;2025-07-25 11:37:50.06021+02;2025-07-25 11:37:50.228839+02;idle in transaction;\N;320065724;SELECT * FROM public.databasechangelog ORDER BY DATEEXECUTED
      

      where the `320065724` is the "blocking" transaction ID. Pending transactions will be like:

      342328;10;manytrans;-1;2025-07-25 18:02:08.210974+02;2025-07-27 12:59:55.39222+02;idle in transaction;2466414943;\N;UPDATE table7 SET count=0\;
      

      where the `2466414943` is the currently active transaction ID - and in this case higher by 2146349219 than the blocked transaction ID. Now you start to have blocked postgres/Satellite. And you will start seeing postgres errors:

      2025-07-27 13:00:03 CEST HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
      	You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
      2025-07-27 13:00:03 CEST WARNING:  database "postgres" must be vacuumed within 1000003 transactions
      2025-07-27 13:00:03 CEST HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
      	You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
      2025-07-27 13:00:03 CEST ERROR:  database is not accepting commands to avoid wraparound data loss in database "postgres"
      2025-07-27 13:00:03 CEST HINT:  Stop the postmaster and vacuum that database in single-user mode.
      	You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
      2025-07-27 13:00:03 CEST STATEMENT:  UPDATE katello_cdn_configurations SET custom_cdn_auth_enabled=FALSE;
      2025-07-27 13:00:03 CEST ERROR:  database is not accepting commands to avoid wraparound data loss in database "postgres"
      2025-07-27 13:00:03 CEST HINT:  Stop the postmaster and vacuum that database in single-user mode.
      

      or foreman errors:

      2025-07-27T13:00:17 [E|dyn|] PersistenceError in executor
      2025-07-27T13:00:17 [E|dyn|] caused by Sequel::DatabaseError: PG::ProgramLimitExceeded: ERROR:  database is not accepting commands to avoid wraparound data loss in database "postgres"
       | HINT:  Stop the postmaster and vacuum that database in single-user mode.
       | You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
       |  (Dynflow::Errors::PersistenceError)
       | /usr/share/gems/gems/sequel-5.83.1/lib/sequel/adapters/postgres.rb:171:in `exec'
      ..
      

      (or candlepin or pulp errors as well)

      Expected behavior:
      No such pending transaction open for days. No such fatal postgres error at the end.

      Business Impact / Additional info:
      Business impact: The issue is dirty ticking bomb. People dont check xmin wanings so they suddenly hit the halting postgres out of nowhere. Especially the most busy Satellites with the biggest impact to customer environment are more likely affected.

      Hint to devels: DuckDuckGo suggests the transaction was seen in liquidbase, some Java library used by candlepin. We noticed the transaction on most 6.15+ Satellites but on none 6.14 or older. BUT liquidbase library has same version on 6.14 and 6.15, both Satellites have `/var/lib/tomcat/webapps/candlepin/WEB-INF/lib/liquibase-core-4.23.2.redhat-00001.jar`, so..?

              rhn-support-synkd Danny Synk
              rhn-support-pmoravec Pavel Moravec
              Danny Synk Danny Synk
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: