Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-9747

MCE discovery filter is using wrong subscription property to determine last-active time

    XMLWordPrintable

Details

    • Installer Sprint 24-27
    • Moderate

    Description

      Description of problem:

      The MCE discovery control performs filtering based on a cluster's last-active time in order to filter out entries for customer self-managed clusters that happen to still appear in the RHOCM subscription list but in fact have been deleted long ago, aka "stale" clusters.  (As a "feature" of the RHOCM  implementation, it seems that subscription entries in the RHOCM account-mgmt service that are there as a result of telemetry (vs. being created as  a result of managed-service provisioning) remain in the database and are returned by queries for a long time, or maybe even forever.  This makes this data very "noisy", and hence filtering-out of stale entries is necessary.)

      Currently, the discovery controller is determine last-active time using the updated_at property of the subscription entry.  Based on analysis of data from RHOCM, it appears that this property is updated any time the subscription entry is updated (for example, to change its display name, or status, or archive/unarchive the entry) and thus does not necessarily represent cluster activity.  It seems the more correct attribute to use is the last_telemetry_date property.

      Notes:

      If the last_telemetry_date property is updated, that constitutes an update to the subscription entry so the updated_at property is also updated, and hence using the updated_at has probably worked "good enough" since it takes careful observation  to see situations where this is a false indication of cluster activity.  But if you look at RHOCM subscription data for eg. deleted clusters whose entries have later been archived and then unarchived, you can see the difference. – the updated_at timestamp will be recently updated (making the cluster seem not-stale when interpreted as cluster activity indication) but in fact the last_telemetry_date timestamp will remain unchanged/in the past/stale.

      Version-Release number of selected component (if applicable):  MCE 2.4.0

      NB: This probably probably exists in all versions of MCE.

      How reproducible: 

      Easily.

      Steps to Reproduce:

      1. Create a self-managed cluster.
      2. Find the cluster/subscription entry for the cluster in RHOCM, and give it a display name to make it easy to identify later.
      3. Delete your self-managed cluster.
      4. Let some time pass (a few days) so its easy to notice timestamp difference and give RHOCM a chance to declare the cluster/subscription Stale.
      5. Go to RHOCM web console and find the entry for your now-deleted cluster.
      6. Ask RHOCM to archive the entry, and then unarchive it.
      7. Pull subscription data from RHOCM via the ocm utility or via API request.
      8. In the entry for your now deleted cluster, observe that updated_at timestemp is updated as triggered by your archive/unarchive actions, but the last_delemetry_date timestemp will still reflect a time a few mins before you deleted the cluster.

      I've attached a sample subscription entry (in json) for a clsuter that I did the above kind of manipulation to.

      I've also attached a screen shot of the discovered-clusters view showing that the use of the wrong timestamp is causing us to say a long-ago-deleted cluster (lithium-2024-01-11-unarchived-2024-01-31) was active very recently ago, when it wasn't.  The only thing that happened recently was my unarchiving and renaming of it in RHOCM.

      Attachments

        Activity

          People

            dbennett@redhat.com Disaiah Bennett
            jgdaniec Joe Gdaniec
            Thuy Nguyen Thuy Nguyen
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: