Uploaded image for project: 'Satellite'
  1. Satellite
  2. SAT-36317

[Regression] - Seen difference in duration and failure rate of most important tests in "Incremental registrations" and "Remote execution (ReX)"

XMLWordPrintable

    • False
    • Important
    • sat-proton
    • None
    • None
    • None
    • None
    • Yes

      Description of problem:

      We have observed changes in the duration of the "Incremental Registrations" and "Remote Execution (ReX)" tests. Additionally, there has been a significant shift in the failure rate.

      How reproducible: 

      Always. 

      Is this issue a regression from an earlier version:

      Yes.

      Steps to Reproduce:

      Try to run concurrent registration and ReX. 

      Actual behavior:

      During routine checks on CPT, we noticed irregularities in the tests. Yesterday, we began an investigation and identified some issues by comparing two test streams — the current stream (Stream 116, running 2025-07-25) and the previous one (Stream 112, running 2025-07-14).

      We noticed that job that normally takes 15 hours now takes 21 hours. Culprits were sections where we are doing concurrent registrations (6 -> 7 hours) and where we are measuring remote executions (3 -> 9 hours). Failure rate also regressed. Please see attached graphs or below linked spreadsheet.

      The gap between runs was reported in slack and fixed in this PR.

      Stream 112 run - https://jenkins-csb-perf-master.dno.corp.redhat.com/job/ContPerfStreamEL9/145/console

      Stream 116 run - https://jenkins-csb-perf-master.dno.corp.redhat.com/job/ContPerfStreamEL9/153/console

      Compared and checked all the available log files and created a sheet with detailed comparison here

      These errors might be related:

      $ grep ' [[EW]|' production-0300.log | grep -v -e 'You are trying to replace' -e 'ignoring associations organization_ids, location_ids audit definition for' -e 'No SSL cert with CN supplied - request from' -e 'Could not find a provider for' -e 'Received .* event from Candlepin. Handling of this event is no longer supported.' -e 'Polling failed, attempt' -e 'Process exited with an unknown status: pid .* exit 22' -e 'No such file or directory @ rb_file_s_rename'[...]
      2025-07-29T03:22:23 [E|app|1a39a8ab] Fact insights_client::hostname could not be imported because of PG::InFailedSqlTransaction: ERROR: current transaction is aborted, commands ignored until end of transaction block
      2025-07-29T03:22:23 [E|app|1a39a8ab] Fact insights_client::obfuscate_ipv4_enabled could not be imported because of PG::InFailedSqlTransaction: ERROR: current transaction is aborted, commands ignored until end of transaction block
      [...]
      2025-07-29T03:32:24 [E|app|01f41379] RestClient::Gone: Katello::Resources::Candlepin::Consumer: 410 Gone
      {"displayMessage":"Unit 98626cc2-a851-4b93-a3ee-2a8e06de1175 has been deleted","requestUuid":"5f2dec16-9588-47d5-b2e1-6e4dc4b7d129","deletedId":"98626cc2-a851-4b93-a3ee-2a8e06de1175"}
      (PUT /candlepin/consumers/98626cc2-a851-4b93-a3ee-2a8e06de1175)
      2025-07-29T03:32:24 [E|app|01f41379] /usr/share/gems/gems/katello-4.18.0.pre.master/app/controllers/katello/api/rhsm/candlepin_proxies_controller.rb:227:in `block in consumer_destroy'
      

      Expected behavior:

      It is expected to work normally, as it did previously.

      Additional info:

      One hour of production.log where some registrations were happening is attached: 2025-07-29-SAT-36238-production.log

              Unassigned Unassigned
              rhn-support-ikaur Imaanpreet Kaur
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: