Uploaded image for project: 'Satellite'
  1. Satellite
  2. SAT-23581 CertificateCleanupJob fails with foreign key constraint violation on table cp_certificate
  3. SAT-23703

CertificateCleanupJob fails with foreign key constraint violation on table cp_certificate

XMLWordPrintable

    • Icon: Sub-task Sub-task
    • Resolution: Done
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • None
    • 0
    • False
    • Hide

      None

      Show
      None
    • False
    • 0
    • Rocket

      +++ This bug was initially created as a clone of Bug #2229095 +++

      Description of problem:

      On a very loaded system "CertificateCleanupJob" job shows following error:

      2023-08-03 17:45:22,829 [thread=Thread-3 (ActiveMQ-client-global-threads)] [job=8a35b3cf89b9b7aa0189bb44de043b1b, job_key=CertificateCleanupJob, org=, csid=] WARN org.hibernate.engine.jdbc.spi.SqlExceptionHelper - SQL Error: 0, SQLState: 23503
      2023-08-03 17:45:22,829 [thread=Thread-3 (ActiveMQ-client-global-threads)] [job=8a35b3cf89b9b7aa0189bb44de043b1b, job_key=CertificateCleanupJob, org=, csid=] ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper - ERROR: update or delete on table "cp_cert_serial" violates foreign key constraint "fk16370c54b9410fc" on table "cp_certificate"
      Detail: Key (id)=(1298353109733755480) is still referenced from table "cp_certificate".
      2023-08-03 17:45:22,840 [thread=Thread-3 (ActiveMQ-client-global-threads)] [job=8a35b3cf89b9b7aa0189bb44de043b1b, job_key=CertificateCleanupJob, org=, csid=] ERROR org.candlepin.async.JobManager - Job "CertificateCleanupJob" failed in 20722815ms

      Version-Release number of selected component (if applicable):
      Satellite 6.11

      — Additional comment from on 2023-08-04T06:22:33Z

      Hello,

      I found a series of changes about a similar problem for different certificates:

      https://github.com/candlepin/candlepin/pull/3336/files
      https://bugzilla.redhat.com/show_bug.cgi?id=2084130

      I'm wondering why a similar change wasn't applied to https://github.com/candlepin/candlepin/blob/e4dbad0fe007af49baca0550bdd6828be1b7f0fd/src/main/java/org/candlepin/model/ContentAccessCertificateCurator.java#L142

      Anyway, the customer has a very big and dynamic environment with 20.000+ hosts. Massive re-registrations are often. Please advise what kind of information you need to proceed and the case owner will get it for you. The case is urgent and the customer is very hot. I would appreciate your assistance.

      Thanks


      ALEXEY MASOLOV (RHCA)
      PRINCIPAL SOFTWARE MAINTENANCE ENGINEER
      MANAGEMENT AND AUTOMATION (SATELLITE, CLOUDFORMS, ANSIBLE)
      RED HAT ASIA PACIFIC, PTY LTD
      Level 33, 1 Denison Street
      North Sydney, NSW 2060 Australia
      amasolov@redhat.com M: +61422561504

      — Additional comment from on 2023-08-04T09:34:51Z

      Hi Niko

      I asked the customer to run the following query to see how many affected subscription certs.
      ~~~
      su - postgres -c "psql candlepin -c \"select cs.*, c.id as certificate_id from cp_cert_serial cs inner join cp_certificate c on c.serial_id = cs.id where cs.revoked = 't';\""
      ~~~

      The result is 525 rows:
      ~~~
      id | created | updated | expiration | revoked |
      certificate_id
      -----------------------------------------------------------------------------------------------------+-------
      --------------------------
      41809072875806621 | 2016-09-13 18:44:06.511+00 | 2016-09-13 18:44:06.511+00 | 2017-03-02 04:59:59+00 | t | 8a35b3c
      f572466e1015724dd03180059
      3533872583381687016 | 2016-09-13 18:44:06.527+00 | 2016-09-13 18:44:06.527+00 | 2017-03-02 04:59:59+00 | t | 8a35b3c
      f572466e1015724dd0389005c
      6506351485057036602 | 2016-09-13 18:44:06.539+00 | 2016-09-13 18:44:06.539+00 | 2017-03-02 04:59:59+00 | t | 8a35b3c
      f572466e1015724dd03af005f
      6212415380020719205 | 2016-09-13 18:45:49.674+00 | 2016-09-13 18:45:49.674+00 | 2017-03-02 04:59:59+00 | t | 8a35b3c
      f572466e1015724de95fc0103
      7586943406146698078 | 2016-09-13 18:45:49.662+00 | 2016-09-13 18:45:49.662+00 | 2017-03-02 04:59:59+00 | t | 8a35b3c
      f572466e1015724de962a0105
      6170024571460468072 | 2016-09-13 18:45:49.645+00 | 2016-09-13 18:45:49.645+00 | 2017-03-02 04:59:59+00 | t | 8a35b3c
      f572466e1015724de963b0107
      2202953283407406057 | 2016-09-23 19:18:14.88+00 | 2016-09-23 19:18:14.88+00 | 2017-03-02 04:59:59+00 | t | 8a35b3c
      f572e6de60157587bdc9801d7
      386440596928235927 | 2016-09-23 19:18:14.914+00 | 2016-09-23 19:18:14.914+00 | 2017-09-20 03:59:59+00 | t | 8a35b3c
      f572e6de60157587bdcc801d9
      8508824282059251738 | 2016-09-23 19:18:14.905+00 | 2016-09-23 19:18:14.905+00 | 2017-03-02 04:59:59+00 | t | 8a35b3c
      f572e6de60157587bdd1f01dc
      4545319024196052552 | 2016-09-23 19:18:14.888+00 | 2016-09-23 19:18:14.888+00 | 2017-03-02 04:59:59+00 | t | 8a35b3c
      f572e6de60157587bdd2a01de
      2373124554098943983 | 2016-09-23 19:18:14.865+00 | 2016-09-23 19:18:14.865+00 | 2017-10-01 03:59:59+00 | t | 8a35b3c
      f572e6de60157587bdd4501e1
      1504414217263350500 | 2016-09-23 19:18:14.897+00 | 2016-09-23 19:18:14.897+00 | 2017-03-02 04:59:59+00 | t | 8a35b3c
      f572e6de60157587bdd5a01e4
      ...
      23576426846028686 | 2018-01-23 22:28:59.977+00 | 2018-01-23 22:28:59.977+00 | 2019-10-01 03:59:59+00 | t | 8a35b3c
      f612516560161252384b80680
      977662442025664028 | 2018-01-23 22:29:00.005+00 | 2018-01-23 22:29:00.005+00 | 2019-10-01 03:59:59+00 | t | 8a35b3c
      f612516560161252384bf0682
      1500499259872337028 | 2018-01-23 22:28:59.955+00 | 2018-01-23 22:28:59.955+00 | 2019-10-01 03:59:59+00 | t | 8a35b3c
      f612516560161252384c80684
      (525 rows)
      ~~~

      Do you think we can workaround this by using the following sql statement to delete all the expired/revoked subscription certs?

      su - postgres -c "psql candlepin -c \"delete from cp_certificate where id in (select c.id as certificate_id from cp_cert_serial cs inner join cp_certificate c on c.serial_id = cs.id where cs.revoked = 't');\""

      — Additional comment from on 2023-08-04T14:06:27Z

      (In reply to Alexey Masolov from comment #1)
      > Hello,
      >
      > I found a series of changes about a similar problem for different
      > certificates:
      >
      > https://github.com/candlepin/candlepin/pull/3336/files
      > https://bugzilla.redhat.com/show_bug.cgi?id=2084130
      >
      > I'm wondering why a similar change wasn't applied to
      > https://github.com/candlepin/candlepin/blob/
      > e4dbad0fe007af49baca0550bdd6828be1b7f0fd/src/main/java/org/candlepin/model/
      > ContentAccessCertificateCurator.java#L142

      Hi,

      That was probably an oversight. Someone is currently investigating the issue.

      Thanks,
      Nikos

      — Additional comment from on 2023-08-04T14:17:41Z

      Can I get a database dump for this?

      — Additional comment from on 2023-08-04T18:12:31Z

      Hello Team,

      The customer sent the database dump from Candlepin as the file is 2GB I am sending it to:
      https://file.rdu.redhat.com/~alsouza/BZ-2229095/candlepin-20230804.sql.gz

      I restored the database in my Satellite 6.11, follow the information below:
      https://alsouza-sat611b.d.sysmgmt.cee.redhat.com
      credentials: root/redhat, admin/redhat
      candlepin database from visa is 'candlepin_03574899'

      The Satellite uses the default candlepin database (candlepin), not the visa candlepin.

      If you need anything else, please let me know.

      Regards,
      Aldrey Souza

      — Additional comment from on 2023-08-04T20:09:19Z

      Hello Team,

      I believe happens some issues with the candlepin jobs in the past and didn't clean up the references on table 'cp_certificate'.

      In the current candlepin version, it was trying to run the CertificateCleanupJob and getting the error 'update or delete on table "cp_cert_serial" violates foreign key constraint'.

      I was checking on the customer database and I didn't find another reference for the IDs with the certificate expired.

      I restored the customer database in another Satellite to check the issue.

      I performed the hao's query below:
      ~~~
      su - postgres -c "psql candlepin -c \"select cs.*, c.id as certificate_id from cp_cert_serial cs inner join cp_certificate c on c.serial_id = cs.id where cs.revoked = 't';\""
      ~~~

      Getting the same result '525 rows'.

      So I performed the hao's suggestion to remove the expired certificates:
      ~~~
      candlepin_03574899=# delete from cp_certificate where id in (select c.id as certificate_id from cp_cert_serial cs inner join cp_certificate c on c.serial_id = cs.id where cs.revoked = 't');
      DELETE 525
      ~~~

      Checking again by revoked certificates:
      ~~~
      candlepin_03574899=# select cs.*, c.id as certificate_id from cp_cert_serial cs inner join cp_certificate c on c.serial_id = cs.id where cs.revoked = 't';
      id | created | updated | expiration | revoked | certificate_id
      ----------------------------------+---------------
      (0 rows)
      ~~~

      As the CertificateCleanupJob is failing when trying to perform the query below:
      ~~~ postgresql logs
      2023-08-03 17:45:22 GMT ERROR: update or delete on table "cp_cert_serial" violates foreign key constraint "fk16370c54b9410fc" on table "cp_certificate"
      2023-08-03 17:45:22 GMT DETAIL: Key (id)=(1298353109733755480) is still referenced from table "cp_certificate".
      2023-08-03 17:45:22 GMT STATEMENT: delete from cp_cert_serial where revoked=true and expiration<$1
      ~~~

      Checking all certificated revoked, I got this result:
      ~~~
      select count from cp_cert_serial where revoked=true;
      count
      ---------
      6880875
      (1 row)
      ~~~

      So I believe after removing the very old certificates the Job will have a good job to do.

      Regards,
      Aldrey Souza

      — Additional comment from on 2023-08-04T20:23:42Z

      I think that will give you a workaround. As you got no constraint violations from cp_pool on that delete, it is safe to assume that no pools were referencing those rows in cp_certificate.

      — Additional comment from on 2023-08-10T21:39:07Z

      Hello Team,

      I was making some tests in the candlepin restored.

      Checking the table 'cp_cert_serial', I see this information:
      ~~~
      candlepin_03574899=# \d cp_cert_serial
      Table "public.cp_cert_serial"
      Column | Type | Collation | Nullable | Default
      ------------------------------------------------------------
      id | bigint | | not null |
      created | timestamp with time zone | | |
      updated | timestamp with time zone | | |
      expiration | timestamp with time zone | | |
      revoked | boolean | | not null | false
      Indexes:
      "cp_cert_serial_pkey" PRIMARY KEY, btree (id)
      Referenced by:
      TABLE "cp_cont_access_cert" CONSTRAINT "cont_acc_serial_fk" FOREIGN KEY (serial_id) REFERENCES cp_cert_serial(id)
      TABLE "cp_certificate" CONSTRAINT "fk16370c54b9410fc" FOREIGN KEY (serial_id) REFERENCES cp_cert_serial(id)
      TABLE "cp_id_cert" CONSTRAINT "fk456324f64b9410fc" FOREIGN KEY (serial_id) REFERENCES cp_cert_serial(id)
      TABLE "cp_ent_certificate" CONSTRAINT "fke76508714b9410fc" FOREIGN KEY (serial_id) REFERENCES cp_cert_serial(id)
      TABLE "cp_ueber_cert" CONSTRAINT "ueber_cert_serial_fk" FOREIGN KEY (serial_id) REFERENCES cp_cert_serial(id)
      ~~~

      Removing the old certificate on 'cp_certificate' (already done on the customer):
      ~~~
      candlepin_03574899=# delete from cp_certificate where id in (select c.id as certificate_id from cp_cert_serial cs inner join cp_certificate c on c.serial_id = cs.id where cs.revoked = 't');
      DELETE 525
      ~~~

      Count the entries using the expiration<='2018-01-01 00:00:00':
      ~~~
      candlepin_03574899=# select count from cp_cert_serial where revoked=true and expiration<='2018-01-01 00:00:00';
      count
      --------
      107325
      (1 row)
      ~~~

      When I run the EXPLAIN ANALYZE to delete:
      ~~~
      candlepin_03574899=# EXPLAIN ANALYZE DELETE from cp_cert_serial where revoked=true and expiration<'2018-01-01 00:00:00';
      LOG: duration: 7245816.979 ms statement: EXPLAIN ANALYZE DELETE from cp_cert_serial where revoked=true and expiration<'2018-01-01 00:00:00';
      QUERY PLAN
      ---------------------------------------------------------------------------------------------------------------------------------------------------
      Delete on cp_cert_serial (cost=1943.76..65803.34 rows=98430 width=6) (actual time=1136.178..1136.180 rows=0 loops=1)
      -> Bitmap Heap Scan on cp_cert_serial (cost=1943.76..65803.34 rows=98430 width=6) (actual time=11.785..154.881 rows=107325 loops=1)
      Recheck Cond: (expiration < '2018-01-01 00:00:00-05'::timestamp with time zone)
      Filter: revoked
      Rows Removed by Filter: 24
      Heap Blocks: exact=8628
      -> Bitmap Index Scan on cp_cert_serial_expkey (cost=0.00..1919.15 rows=103829 width=0) (actual time=10.321..10.321 rows=107349 loops=1)
      Index Cond: (expiration < '2018-01-01 00:00:00-05'::timestamp with time zone)
      Planning Time: 0.172 ms
      Trigger for constraint cont_acc_serial_fk: time=7227062.362 calls=107325
      Trigger for constraint fk16370c54b9410fc: time=8637.458 calls=107325
      Trigger for constraint fk456324f64b9410fc: time=2954.545 calls=107325
      Trigger for constraint fke76508714b9410fc: time=2628.675 calls=107325
      Trigger for constraint ueber_cert_serial_fk: time=3017.378 calls=107325
      Execution Time: 7245655.747 ms
      (15 rows)
      ~~~

      We can see a huge time in constraint 'cont_acc_serial_fk'.

      Reviewing the constraint on the table 'cp_cert_serial':
      ~~~
      Referenced by:
      TABLE "cp_cont_access_cert" CONSTRAINT "cont_acc_serial_fk" FOREIGN KEY (serial_id) REFERENCES cp_cert_serial(id)
      TABLE "cp_certificate" CONSTRAINT "fk16370c54b9410fc" FOREIGN KEY (serial_id) REFERENCES cp_cert_serial(id)
      TABLE "cp_id_cert" CONSTRAINT "fk456324f64b9410fc" FOREIGN KEY (serial_id) REFERENCES cp_cert_serial(id)
      TABLE "cp_ent_certificate" CONSTRAINT "fke76508714b9410fc" FOREIGN KEY (serial_id) REFERENCES cp_cert_serial(id)
      TABLE "cp_ueber_cert" CONSTRAINT "ueber_cert_serial_fk" FOREIGN KEY (serial_id) REFERENCES cp_cert_serial(id)
      ~~~

      Only the tables below have the index to 'serial_id':
      cp_certificate
      cp_id_cert
      cp_ent_certificate

      So is missing the index to 'serial_id' on tables 'cp_cont_access_cert' and 'cp_ueber_cert'.

      I created the indexes below:
      ~~~
      CREATE INDEX cp_cont_access_cert_serial_id_idx on cp_cont_access_cert(serial_id);
      CREATE INDEX cp_ueber_cert_serial_id_idx on cp_ueber_cert(serial_id);
      ~~~

      Hao's created the index below too:
      ~~~
      CREATE INDEX cp_consumer_cont_acc_cert_id_idx on cp_consumer(cont_acc_cert_id);
      ~~~

      I run analyze on tables to allow the postgres to process better the queries:
      ~~~
      ANALYZE VERBOSE cp_cont_access_cert;
      ANALYZE VERBOSE cp_certificate;
      ANALYZE VERBOSE cp_id_cert;
      ANALYZE VERBOSE cp_ent_certificate;
      ANALYZE VERBOSE cp_ueber_cert;
      ANALYZE VERBOSE cp_consumer;
      ~~~

      After that, we see the CertificateCleanupJob finishing in 5 minutes on Customer-side:
      ~~~
      start_time: 2023-08-10 16:46:39.484+00
      end_time: 2023-08-10 16:51:34.862+00
      ~~~

      In the Candlepin database from the Customer, we see the revoked certificates below:
      ~~~ command: su - postgres -c "psql candlepin -c \"SELECT count from cp_cert_serial where revoked=true;\""
      count
      ---------
      6880875
      (1 row)
      ~~~

      After the CertificateCleanupJob finished, we see this information:
      ~~~
      count
      ---------
      3248452
      (1 row)
      ~~~

      I am repeating all processes to double-check if I miss some additional information.

      Regards,
      Aldrey Souza

      — Additional comment from on 2023-08-13T23:51:25Z

      Hi Team

      Any idea why we only delete the certificates that are both revoked and expired

      candlepin_03574899=# SELECT count from cp_cert_serial where revoked=true;
      count
      ---------
      3237496
      (1 row)

      1. SELECT count from cp_cert_serial where revoked=true and expiration >= NOW();
        count
        ---------
        3236349
        (1 row)

      It is based on the expiry restriction below:
      ~~~
      private Date getExpiryRestriction()

      { Calendar cal = Calendar.getInstance(TimeZone.getTimeZone("UTC")); // Set to midnight first cal.set(Calendar.MINUTE, 0); cal.set(Calendar.SECOND, 0); cal.set(Calendar.HOUR_OF_DAY, 0); cal.set(Calendar.MILLISECOND, 0); // Subtract a day to put us in "yesterday" relative to midnight UTC of whatever "today" is cal.add(Calendar.DAY_OF_MONTH, -1); return cal.getTime(); }

      /**

      • Deletes all revoked cert serials that have expired.
        *
      • @return the total number of serials that were deleted.
        */
        @Transactional
        public int deleteRevokedExpiredSerials() { String hql = "DELETE FROM CertificateSerial c WHERE c.revoked = true" + " AND c.expiration < :cutoff"; Query query = this.getEntityManager().createQuery(hql); query.setParameter("cutoff", getExpiryRestriction()); return query.executeUpdate(); }

        ~~~

      — Additional comment from on 2023-08-14T00:11:04Z

      select count from cp_cert_serial cs left join cp_cont_access_cert ac on ac.serial_id = cs.id left join cp_certificate c on c.serial_id = cs.id left join cp_id_cert ic on ic.serial_id = cs.id left join cp_ent_certificate ec on ec.serial_id = cs.id left join cp_ueber_cert uc on uc.serial_id = cs.id where ac.id is null and c.id is null and ic.id is null and ec.id is null and uc.id is null;
      LOG: duration: 27134.665 ms statement: select count from cp_cert_serial cs left join cp_cont_access_cert ac on ac.serial_id = cs.id left join cp_certificate c on c.serial_id = cs.id left join cp_id_cert ic on ic.serial_id = cs.id left join cp_ent_certificate ec on ec.serial_id = cs.id left join cp_ueber_cert uc on uc.serial_id = cs.id where ac.id is null and c.id is null and ic.id is null and ec.id is null and uc.id is null;
      count
      ---------
      3238311

      It seems like only 157969 entries in cp_cert_serial are still having reference from other table. See below:
      ~~~~~~~~~
      LOG: duration: 1871.406 ms statement: select count from cp_cert_serial cs left join cp_cont_access_cert ac on ac.serial_id = cs.id left join cp_certificate c on c.serial_id = cs.id left join cp_id_cert ic on ic.serial_id = cs.id left join cp_ent_certificate ec on ec.serial_id = cs.id left join cp_ueber_cert uc on uc.serial_id = cs.id where ac.id is not null or c.id is not null or ic.id is not null or ec.id is not null or uc.id is not null;
      count
      --------
      157969
      (1 row)

      candlepin_03574899=# select count from cp_id_cert;
      count
      -------
      56581
      (1 row)

      candlepin_03574899=# select count from cp_cont_access_cert;
      count
      -------
      56001
      (1 row)

      candlepin_03574899=# select count from cp_certificate;
      count
      -------
      870
      (1 row)

      candlepin_03574899=# select count from cp_ent_certificate;
      count
      -------
      44515
      (1 row)

      candlepin_03574899=# select count from cp_ueber_cert;
      count
      -------
      2
      (1 row)
      ~~~~~~~~~

      In this case, is it ok for us to delete those without referring by other tables.

      — Additional comment from on 2023-08-15T00:20:00Z

      Hi William

      Do you mind respond to the questions? Customer is waiting. Thanks.

      — Additional comment from on 2023-08-15T12:24:35Z

      No. You should let the CertificateCleanupJob do its work. If I recall correctly, the serials in that table which are not yet expired act as a revoked list that is in use.

      — Additional comment from on 2023-08-15T12:47:06Z

      (In reply to William Poteat from comment #12)
      > No. You should let the CertificateCleanupJob do its work. If I recall
      > correctly, the serials in that table which are not yet expired act as a
      > revoked list that is in use.

      Do you mind explain in more details about how and where the revoked list is in use? and why we can't delete 3238311 revoked serials not reference by other tables?

      Thanks.

      — Additional comment from on 2023-08-15T12:57:19Z

      Do you know what is the impact if we delete the revocation list? As it is about 3 millions revoked serials in the customer database which might affect the performance somehow.

      — Additional comment from on 2023-08-15T13:53:01Z

      Then certificates that are not yet expired but have been revoked can be used to retrieve content.

      What specific performance is currently affected? Deleting rows with no concrete justification is not advisable.

      If there are rows that the job is not removing, they are there for a reason.

      — Additional comment from on 2023-08-15T16:13:53Z

      (In reply to William Poteat from comment #15)
      > Then certificates that are not yet expired but have been revoked can be used
      > to retrieve content.
      >
      > What specific performance is currently affected? Deleting rows with no
      > concrete justification is not advisable.
      >
      > If there are rows that the job is not removing, they are there for a reason.

      Please check comment 7 and 8.

      Without creating the indexes, the job stuck for days deleting the expired and revoked certificates. It stuck on deleting 3 millions+ revoked and expired entries in the 'cp_cert_serial' table for days and never finish. While the job is running long, it appears to slow down the Candlepin significantly and becomes unresponsive.

      Customer is very worry that this will happen again and ask if the remaining 3 millions+ revoked entries can be deleted.

      In the Satellite perspective, I think the revoked certs will not be reused to retrieve contents.

      I also suspect that scanning the revocation list can be slow too due to the missing serial id index in the table.

      I am under impression that the revocation list used to store in a file and we had the issue of the file growing too large or something?

      If it is not recommended to delete all revocation list, can we at least cleanup some old entries that have been created for years?

      — Additional comment from on 2023-08-16T04:13:01Z

      I think it is a regression of this bug 1927532

      — Additional comment from on 2023-08-16T08:43:42Z

      It appears that the customer was unregistering/re-registering more than 10K of consumers every month (see below). In this case, the revocation list in 'cp_cert_serial' will continue to grow even more? If yes, do we have a way to control this, such as we only keep revocation list not older than 1 year?

      ~~~

      1. select count from cp_deleted_consumers where created > '2023-05-01' and created < '2023-06-01';
        count
        -------
        11192
        (1 row)
      1. select count from cp_deleted_consumers where created > '2023-06-01' and created < '2023-07-01';
        count
        -------
        13580
        (1 row)
      1. select count from cp_deleted_consumers where created > '2023-07-01' and created < '2023-08-01';
        count
        -------
        49717
        (1 row)
        ~~~

      — Additional comment from on 2023-08-16T12:40:21Z

      If they are revoked and expired, then the job should clean them up. The indexes were added and should eliminate the long job runtime. Are you saying now that the revoked and expired serials are not getting cleaned up by the job? Are they now living past their expiration date despite the correction made?

      If they are revoked but not expired they remain to form the CRL. If revoked certs are deleted then they can be used until they expire which violates their purpose and the subscription agreement.

      The re-registering of all these systems seems like a big waste. I can only suspect that it was used to cover an old bug that I would hope is gone.

      — Additional comment from on 2023-08-21T08:26:45Z

      produban uses satellite as heart of their cloud, so registering a deleting systems on scale happens

      — Additional comment from on 2023-12-06T16:43:59Z

      Moving this back to ASSIGNED, because the fix was reverted from both candlepin-4.2.x and candlepin-4.3.x, since there were no resources to verify this in the next 6.14.z stream, and as far as I know there's no more 6.13.z stream planned.

      When a sat-6.y.z+ flag is added on this, we will fix this again in the appropriate version(s).

      — Additional comment from on 2023-12-07T09:07:56Z

      My understanding is there are two issues behind this BZ, foreign key error and performance:

      1) CertificateCleanupJob fails with foreign key constraint error:

      ERROR: update or delete on table "cp_cert_serial" violates foreign key constraint "fk16370c54b9410fc" on table "cp_certificate"
      Detail: Key (id)=(1298353109733755480) is still referenced from table "cp_certificate".

      Do we need a reproducer for this? Is this error acknowledged to be fixed, or is it recommended to just clean some redundant certs?

      For the workaround / cleanup: what is the right command? Is it the

      su - postgres -c "psql candlepin -c \"delete from cp_certificate where id in (select c.id as certificate_id from cp_cert_serial cs inner join cp_certificate c on c.serial_id = cs.id where cs.revoked = 't');\""

      one?

      2) Independent problem is the performance: deleting hundreds of thousands (or millions) of such certs takes hours to days what makes candlepin unresponsive during that time.

      Is this planned to be fixed? What indices can be added as a workaround - and do we need to remove them before an upgrade (that would, as a part of fix, try to add them and re-adding an index would fail and halt the upgrade, potentially)? Is it the:

      ~~~
      CREATE INDEX cp_cont_access_cert_serial_id_idx on cp_cont_access_cert(serial_id);
      CREATE INDEX cp_ueber_cert_serial_id_idx on cp_ueber_cert(serial_id);
      ~~~

      Hao's created the index below too:
      ~~~
      CREATE INDEX cp_consumer_cont_acc_cert_id_idx on cp_consumer(cont_acc_cert_id);
      ~~~

      ?

      (we should write KCS for it but I would like to double-check its content before writing)

      (please consider the questions "will it be fixed" as informative, not pushing for specific fix/solution)

      — Additional comment from on 2023-12-07T13:31:21Z

      Cleaning the data on an ad hoc basis is the plan. There is no need to write code to clean these rows. The current code does not leave them and it would not be an effective use of anyone's time.

      The indices are not a workaround, they are the fix. They have been added to the current Candlepin for use in Satellite. No patch will be produced to create these indices. They should be applied to a database as needed.

      — Additional comment from on 2023-12-07T15:20:45Z

      Thanks for confirmation, KCS created: https://access.redhat.com/solutions/7048429

      — Additional comment from on 2023-12-08T07:53:48Z

      I am linking support cases where I found (by some limited search) sosreport shows the same error affects these Satellites. Despite the cases were opened for different issues, these customers are affected by this problem as well.

      — Additional comment from on 2023-12-12T12:16:51Z

      To add even more context to Pavel's comment above (#c22), after further investigation:

      This issue is really multiple issues:
      1. A foreign key violation against fk54b0f288a0b39916 (referencing table cp_upstream_consumer) that was already fixed in https://bugzilla.redhat.com/show_bug.cgi?id=2084130
      2. The job being very slow due to lack of appropriate indices.
      3. A foreign key violation against fk16370c54b9410fc (referencing table cp_certificate) that has not yet to be addressed. This violation should have never happened under
      normal circumstances because certificates in that table should never be revoked. However, an old db migration script (https://github.com/candlepin/candlepin/blob/6f1a83ced858abc560666af90ecc659f138bb338/src/main/resources/db/changelog/20170301083925-update-revoked-cert-serial-data.xml#L24-L32) accidentally set those to revoked, and once they also were expired, this fk violation started happening.

      We will use this BZ id to fix both issues 2 and 3, and also prevent any FK violations from happening in the future due to rogue bad data.

      — Additional comment from on 2023-12-16T10:44:23Z

      (In reply to William Poteat from comment #19)
      > If they are revoked but not expired they remain to form the CRL. If revoked
      > certs are deleted then they can be used until they expire which violates
      > their purpose and the subscription agreement.

      I have a customer having tons of such cert serials and insists on deleting the old ones - is there same safe way of determining really orphaned records? See their DB:

      candlepin=# SELECT COUNT FROM cp_cert_serial WHERE revoked = 't';
      count
      ---------
      2792396
      (1 row)

      candlepin=# SELECT COUNT FROM cp_cert_serial WHERE revoked != 't';
      count
      -------
      49312
      (1 row)

      candlepin=#

      None is expired, but as you see, very vast majority of the 2.8M cert serials are revoked. Created and/or updated timestamps span from 2015 till today, like:

      id | created | updated | expiration | revoked
      ----------------------------------------------------------------------------------------------------------
      6825476862569704537 | 2015-06-30 16:45:40.313+02 | 2015-07-01 12:00:00.07+02 | 2031-06-30 16:45:40.313+02 | t
      ..
      7571731959927718041 | 2023-12-04 11:22:57.323+01 | 2023-12-06 13:09:04.316+01 | 2024-12-04 10:22:57.323+01 | t

      None of the serials is referenced by either certificate (cp_certificate, cp_ent_certificate, cp_id_cert, cp_cont_access_cert and cp_ueber_cert tables checked).

      Quite certainly most of the certs are pure orphans left there due to some old & fixed bug. And vast majority of them will not be ever used.

      The customer insists on deleting evident orphans. Is there some safe way of detecting what records wont be ever used / what records are safe to delete?

      When needed, we have candlepin DB backup from the customer ( 10.37.195.89 , root / pmoravec-sat612-03683245, access db via su - postgres -c "psql candlepin" command; the VM might be down, if so ping me to start it)

      (please reassign the needinfo to anybody more appropriate or let me know were else I can raise the question (some mailing list?))

      Thanks in advance for assisting me.

      — Additional comment from on 2023-12-18T12:19:07Z

      (In reply to Pavel Moravec from comment #27)
      > None is expired, but as you see, very vast majority of the 2.8M cert serials
      > are revoked. Created and/or updated timestamps span from 2015 till today,

      The identity certificate default duration used to be 16 years (recently reduced to 5 years),
      which is why you'll see such old identity certs/serials.

      > None of the serials is referenced by either certificate (cp_certificate,
      > cp_ent_certificate, cp_id_cert, cp_cont_access_cert and cp_ueber_cert tables
      > checked).

      Check cp_cdn_certificate as well, although I highly doubt there will be something (I don't think this has ever been used).

      >
      > The customer insists on deleting evident orphans. Is there some safe way of
      > detecting what records wont be ever used / what records are safe to delete?

      Under normal circumstances, all serials that are revoked should not be referenced by anything.
      Deleting them, SQL-wise, should never throw a constraint violation (which is why the issue in this BZ is a bug).
      Revoked serials are not actually used/checked by Satellite (which means technically, someone with a revoked certificate could still use it
      to either authenticate with candlepin or access content). They are only used in Hosted Candlepin (so that the CDN will not
      let you get access to content with a revoked cert). I saw a jira about a year ago for implementing this in Satellite but
      I doubt it will happen any time soon.

      So the tl;dr: if they are revoked, and are not referenced by any table, even though they're not expired, you are safe to just delete them.

      — Additional comment from on 2024-01-08T13:15:10Z

      Update in case this bug is re-triaged:

      — Additional comment from on 2024-01-29T14:17:49Z

      Verified.

      Version Tested: Satellite 6.15.0 Snap 8

      Verification Steps:
      1. Import a manifest on satellite.
      2. Schedule CertificateCleanupJob to run every minute using command "candlepin.async.jobs.CertificateCleanupJob.schedule=0 * * * * ?" in /etc/candlepin/candlepin.conf and restart tomcat.
      3. Find the id using query "select cs.id from cp_cert_serial cs inner join cp_certificate c on c.serial_id = cs.id;"
      4. Update that id to be revoked using command "UPDATE cp_cert_serial SET revoked=true, expiration='2022-01-15 00:00:00.000000' WHERE id=<one of the ids returned by the previous query>;"
      5. Monitor "/var/log/candlepin/candlepin.log" for Foreign Key violations.

      Observation:
      No Foreign Key violations are observed in candlepin logs.

            rh-ee-amolpati Amol Patil
            rh-ee-amolpati Amol Patil
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: