Uploaded image for project: 'Red Hat Advanced Cluster Security'
  1. Red Hat Advanced Cluster Security
  2. ROX-32843

Machine-to-machine auth may be a source of Central DB connection hoarding

    • False
    • Hide

      None

      Show
      None
    • False

      USER PROBLEM
      What is the user experiencing as a result of the bug? Include steps to reproduce.

      A user reported their Centrals suddenly started to hoard DB connections, case/04360186. An investigation was not fruitful until the user came back saying they figured it out. From the customer case:

      We have identified and resolved the issue. We are using a declarative Machine-to-Machine (M2M) configuration; however, one of our endpoints was unreachable. The Central instance was attempting to add this endpoint every minute alongside the healthy ones, which eventually caused Central to become unresponsive.
      
      To resolve this, I removed the M2M configuration from the Custom Resource (CR), deleted the stale endpoint from the ConfigMap, and reconfigured the CR. The instances have remained stable for the last few hours, and we will continue to monitor the situation.
      
      Please note that stale endpoints added via the UI do not cause any issues. The instability only occurs when they are added declaratively.
      

      In the logs, this line would hint at a problem:

      root logger: 2026/01/25 15:35:43.196709 logging.go:280: Error: Unexpected Error: creating token exchanger for config 2dbac0ec-985e-5e6d-a83c-7be8fa09e786: creating OIDC provider for issuer "https://storage.googleapis.com/caas_ocp_gcp_1330_prod_oidc": 403 Forbidden: <?xml version='1.0' encoding='UTF-8'?><Error><Code>AccessDenied</Code><Message>Access denied.</Message><Details>Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist).</Details></Error> 
      

      CONDITIONS
      What conditions need to exist for a user to be affected? Is it everyone? Is it only those with a specific integration? Is it specific to someone with particular database content? etc.

      A "stale" endpoint added to m2m via declarative config.

      ROOT CAUSE
      What is the root cause of the bug?

      Initial investigation by rh-ee-dashrews pointed at auth datastore. Apparently, a rollback function does not rollback the database connection.

      FIX
      How was the bug fixed (this is more important if a workaround was implemented rather than an actual fix)?

      As per rh-ee-dashrews, either the call to UpsertTokenExchanger needs moved outside the acquisition of the database connection OR when that call fails we need to call both rollback functions.

      I wonder whether we can make further defensive changes in the DB code to prevent errors like this in the future.

              rh-ee-dashrews David Shrewsberry
              aruklets@redhat.com Alexander Rukletsov
              David Shrewsberry, Stephan Hesselmann
              ACS Sensor & Ecosystem
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: