Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-33715

Reduce the number of calls to the subscriptions fetchOrganization endpoint from console-operator

    • No
    • False
    • Hide

      None

      Show
      None
    • Release Note Not Required
    • In Progress

      Description of problem:

      console-operator is fetching the organization ID from OCM on every sync call, which is too often. We need to reduce the fetch period.

      Version-Release number of selected component (if applicable):

      4.16

      How reproducible:

      Always

      Steps to Reproduce:

          1.
          2.
          3.
          

      Actual results:

          

      Expected results:

          

      Additional info:

          

            [OCPBUGS-33715] Reduce the number of calls to the subscriptions fetchOrganization endpoint from console-operator

            Errata Tool added a comment -

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory (Moderate: OpenShift Container Platform 4.17.0 bug fix and security update), and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHSA-2024:3718

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Moderate: OpenShift Container Platform 4.17.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:3718

            Thanks rhn-support-yapei ,

             

            Looking at the logs, I see we have received 50+ requests to fetch the organizationID for this cluster in the last day.

             

            Does the UI tries to fetch the organizationID from OCM every time someone access a specific page?

             

            I can share the logs with you via slack (apparently I don't have permissions to create attachments in this Jira project).

             

            Cristiano Veiga added a comment - Thanks rhn-support-yapei ,   Looking at the logs, I see we have received 50+ requests to fetch the organizationID for this cluster in the last day.   Does the UI tries to fetch the organizationID from OCM every time someone access a specific page?   I can share the logs with you via slack (apparently I don't have permissions to create attachments in this Jira project).  

            YaDan Pei added a comment -

            Today I'm checking against a self managed OCP cluster

            $ oc get clusterversion
            NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
            version   4.17.0-0.nightly-2024-07-20-191204   True        False         9h      Cluster version is 4.17.0-0.nightly-2024-07-20-191204
            
            $ oc get clusterversion version -o json | jq .spec
            {
              "channel": "stable-4.17",
              "clusterID": "039c5e14-1c65-4515-b43a-d4eb3c254094"
            } 

            YaDan Pei added a comment - Today I'm checking against a self managed OCP cluster $ oc get clusterversion NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS version   4.17.0-0.nightly-2024-07-20-191204   True        False         9h      Cluster version is 4.17.0-0.nightly-2024-07-20-191204 $ oc get clusterversion version -o json | jq .spec {   "channel" : "stable-4.17" ,   "clusterID" : "039c5e14-1c65-4515-b43a-d4eb3c254094" }

            Hi rhn-support-yapei , thanks for looking. 

            Can you please provide the clusterID that you used for the verification? I would like to check our logs.

            Also, is this in the stage or production environment?

            Cristiano Veiga added a comment - Hi rhn-support-yapei , thanks for looking.  Can you please provide the clusterID that you used for the verification? I would like to check our logs. Also, is this in the stage or production environment?

            YaDan Pei added a comment -

            On a 4.17.0-0.nightly-2024-07-20-191204 cluster, I increased operatorLogLevel to Trace

            $ oc get clusterversion
            NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
            version   4.17.0-0.nightly-2024-07-20-191204   True        False         9h      Cluster version is 4.17.0-0.nightly-2024-07-20-191204
            
            $ oc get console.operator cluster -o json | jq .spec.operatorLogLevel
            "Trace"
            
            $ oc logs console-operator-5476d4bd79-khj4f -n openshift-console-operator | grep 'telemetry config'
            E0723 23:21:38.641843       1 telemetry.go:101] telemetry config error: empty OCM response
            E0723 23:21:42.199783       1 telemetry.go:101] telemetry config error: empty OCM response
            E0723 23:21:45.488400       1 telemetry.go:101] telemetry config error: empty OCM response
            E0723 23:21:47.574355       1 telemetry.go:101] telemetry config error: empty OCM response
            E0723 23:21:48.615655       1 telemetry.go:101] telemetry config error: empty OCM response
            E0723 23:21:51.517561       1 telemetry.go:101] telemetry config error: HTTP request failed with status '429 Too Many Requests'
            E0723 23:21:52.810614       1 telemetry.go:101] telemetry config error: HTTP request failed with status '429 Too Many Requests'
            E0723 23:22:00.512077       1 telemetry.go:101] telemetry config error: HTTP request failed with status '429 Too Many Requests'
            E0723 23:22:06.756308       1 telemetry.go:101] telemetry config error: HTTP request failed with status '429 Too Many Requests'
            E0723 23:22:08.560872       1 telemetry.go:101] telemetry config error: HTTP request failed with status '429 Too Many Requests'
            E0723 23:22:20.785495       1 telemetry.go:101] telemetry config error: empty OCM response
            I0724 01:37:13.297277       1 telemetry.go:95] telemetry config: using cached organization ID
            I0724 01:37:13.333296       1 telemetry.go:95] telemetry config: using cached organization ID
            I0724 01:37:17.618951       1 telemetry.go:95] telemetry config: using cached organization ID
            I0724 01:37:43.853956       1 telemetry.go:95] telemetry config: using cached organization ID
            I0724 01:37:52.502712       1 telemetry.go:95] telemetry config: using cached organization ID
            I0724 01:38:01.806527       1 telemetry.go:95] telemetry config: using cached organization ID
            I0724 01:38:12.953452       1 telemetry.go:95] telemetry config: using cached organization ID
            I0724 01:38:17.620227       1 telemetry.go:95] telemetry config: using cached organization ID
            I0724 01:38:49.324109       1 telemetry.go:95] telemetry config: using cached organization ID 
            ..............all are the same log: telemetry config: using cached organization ID ................
            I0724 08:58:17.892938       1 telemetry.go:95] telemetry config: using cached organization ID

            From operator log, we can see that console only fetches organization ID from OCM at the very beginning then we started to use the cached one https://github.com/openshift/console-operator/blob/master/pkg/console/telemetry/telemetry.go#L87-L118 

            It looks like this is working as expected jhadvig@redhat.com  WDYT?

            YaDan Pei added a comment - On a 4.17.0-0.nightly-2024-07-20-191204 cluster, I increased operatorLogLevel to Trace $ oc get clusterversion NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS version   4.17.0-0.nightly-2024-07-20-191204   True        False         9h      Cluster version is 4.17.0-0.nightly-2024-07-20-191204 $ oc get console. operator cluster -o json | jq .spec.operatorLogLevel "Trace" $ oc logs console- operator -5476d4bd79-khj4f -n openshift-console- operator | grep 'telemetry config' E0723 23:21:38.641843       1 telemetry.go:101] telemetry config error: empty OCM response E0723 23:21:42.199783       1 telemetry.go:101] telemetry config error: empty OCM response E0723 23:21:45.488400       1 telemetry.go:101] telemetry config error: empty OCM response E0723 23:21:47.574355       1 telemetry.go:101] telemetry config error: empty OCM response E0723 23:21:48.615655       1 telemetry.go:101] telemetry config error: empty OCM response E0723 23:21:51.517561       1 telemetry.go:101] telemetry config error: HTTP request failed with status '429 Too Many Requests' E0723 23:21:52.810614       1 telemetry.go:101] telemetry config error: HTTP request failed with status '429 Too Many Requests' E0723 23:22:00.512077       1 telemetry.go:101] telemetry config error: HTTP request failed with status '429 Too Many Requests' E0723 23:22:06.756308       1 telemetry.go:101] telemetry config error: HTTP request failed with status '429 Too Many Requests' E0723 23:22:08.560872       1 telemetry.go:101] telemetry config error: HTTP request failed with status '429 Too Many Requests' E0723 23:22:20.785495       1 telemetry.go:101] telemetry config error: empty OCM response I0724 01:37:13.297277       1 telemetry.go:95] telemetry config: using cached organization ID I0724 01:37:13.333296       1 telemetry.go:95] telemetry config: using cached organization ID I0724 01:37:17.618951       1 telemetry.go:95] telemetry config: using cached organization ID I0724 01:37:43.853956       1 telemetry.go:95] telemetry config: using cached organization ID I0724 01:37:52.502712       1 telemetry.go:95] telemetry config: using cached organization ID I0724 01:38:01.806527       1 telemetry.go:95] telemetry config: using cached organization ID I0724 01:38:12.953452       1 telemetry.go:95] telemetry config: using cached organization ID I0724 01:38:17.620227       1 telemetry.go:95] telemetry config: using cached organization ID I0724 01:38:49.324109       1 telemetry.go:95] telemetry config: using cached organization ID ..............all are the same log: telemetry config: using cached organization ID ................ I0724 08:58:17.892938 1 telemetry.go:95] telemetry config: using cached organization ID From operator log, we can see that console only fetches organization ID from OCM at the very beginning then we started to use the cached one https://github.com/openshift/console-operator/blob/master/pkg/console/telemetry/telemetry.go#L87-L118   It looks like this is working as expected jhadvig@redhat.com   WDYT?

            Cristiano Veiga added a comment - - edited

            Hi rhn-support-yapei jhadvig@redhat.com ,

             

            We are still seeing clusters running openshift versions greater than 4.17.0-0.nightly-2024-07-10-07294 sending hundreds of requests to OCM to fetch the organization. 

             

            Can you please double check if this was fixed? This continues to create a load in AMS as more clusters are installed/upgraded to the faulty versions.

             

             ehimmelr.openshift bvulaj1@redhat.com FYI, we will need to implement stricter rules in the rate limiting mechanism to protect against these requests.

             

            Cristiano Veiga added a comment - - edited Hi rhn-support-yapei jhadvig@redhat.com ,   We are still seeing clusters running openshift versions greater than 4.17.0-0.nightly-2024-07-10-07294 sending hundreds of requests to OCM to fetch the organization.    Can you please double check if this was fixed? This continues to create a load in AMS as more clusters are installed/upgraded to the faulty versions.     ehimmelr.openshift bvulaj1@redhat.com FYI, we will need to implement stricter rules in the rate limiting mechanism to protect against these requests.  

            YaDan Pei added a comment -

            1. Check default ORGANIZATION_ID, this is fetched from OCM

            $ oc get cm console-config -n openshift-console -o yaml
            .......................
                telemetry:
                  CLUSTER_ID: 45dd92f7-21c3-49c6-a991-cb655be29ebf
                  ORGANIZATION_ID: 1V6IJrh1cNmDxgNlAAWZRfupr3B
                  SEGMENT_API_HOST: console.redhat.com/connections/api/v1
                  SEGMENT_JS_HOST: console.redhat.com/connections/cdn
                  SEGMENT_PUBLIC_API_KEY: BnuS1RP39EmLQjP21ko67oDjhbl9zpNU
            .............. 

            2. Add custom ORGANIZATION_ID in cm/ telemetry-config, we can see messages: 

            I0710 13:04:20.263232 1 telemetry.go:90] telemetry config: using custom organization ID

             

            $ oc get cm telemetry-config -n openshift-console-operator -o yaml
            apiVersion: v1
            data:
              ORGANIZATION_ID: 2V8IJrh1cNmDxgNlAAWZRfupr4X
              SEGMENT_API_HOST: console.redhat.com/connections/api/v1
              SEGMENT_JS_HOST: console.redhat.com/connections/cdn
              SEGMENT_PUBLIC_API_KEY: BnuS1RP39EmLQjP21ko67oDjhbl9zpNU
            kind: ConfigMap
            metadata:
              annotations:
                capability.openshift.io/name: Console
                include.release.openshift.io/hypershift: "true"
                include.release.openshift.io/ibm-cloud-managed: "true"
                include.release.openshift.io/self-managed-high-availability: "true"
                include.release.openshift.io/single-node-developer: "true"
              creationTimestamp: "2024-07-10T09:11:18Z"
              name: telemetry-config
              namespace: openshift-console-operator
             

            3. Remove custom ORGANIZATION_ID, then check the operator logs, it shows:

            I0710 13:05:24.159648 1 telemetry.go:95] telemetry config: using cached organization ID

             

            We can see that console didn't send OCM fetch requests, instead we use cached one

            The issue is fixed on 4.17.0-0.nightly-2024-07-10-072949

             

             

            YaDan Pei added a comment - 1. Check default ORGANIZATION_ID, this is fetched from OCM $ oc get cm console-config -n openshift-console -o yaml .......................     telemetry:       CLUSTER_ID: 45dd92f7-21c3-49c6-a991-cb655be29ebf       ORGANIZATION_ID: 1V6IJrh1cNmDxgNlAAWZRfupr3B       SEGMENT_API_HOST: console.redhat.com/connections/api/v1       SEGMENT_JS_HOST: console.redhat.com/connections/cdn       SEGMENT_PUBLIC_API_KEY: BnuS1RP39EmLQjP21ko67oDjhbl9zpNU .............. 2. Add custom ORGANIZATION_ID in cm/ telemetry-config, we can see messages:  I0710 13:04:20.263232 1 telemetry.go:90] telemetry config: using custom organization ID   $ oc get cm telemetry-config -n openshift-console- operator -o yaml apiVersion: v1 data:   ORGANIZATION_ID: 2V8IJrh1cNmDxgNlAAWZRfupr4X   SEGMENT_API_HOST: console.redhat.com/connections/api/v1   SEGMENT_JS_HOST: console.redhat.com/connections/cdn   SEGMENT_PUBLIC_API_KEY: BnuS1RP39EmLQjP21ko67oDjhbl9zpNU kind: ConfigMap metadata:   annotations:     capability.openshift.io/name: Console     include.release.openshift.io/hypershift: " true "     include.release.openshift.io/ibm-cloud-managed: " true "     include.release.openshift.io/self-managed-high-availability: " true "     include.release.openshift.io/single-node-developer: " true "   creationTimestamp: "2024-07-10T09:11:18Z"   name: telemetry-config   namespace: openshift-console- operator 3. Remove custom ORGANIZATION_ID, then check the operator logs, it shows: I0710 13:05:24.159648 1 telemetry.go:95] telemetry config: using cached organization ID   We can see that console didn't send OCM fetch requests, instead we use cached one The issue is fixed on 4.17.0-0.nightly-2024-07-10-072949    

            YaDan Pei added a comment - - edited

            jhadvig@redhat.com  regarding bug verification, do we have some ways to identify whether the ORGANIZATION_ID comes from cached or a OCM fetch?

            Asking since I didn't see useful logs will be printed either in FetchOrganizationID or GetOrganizationID.

            YaDan Pei added a comment - - edited jhadvig@redhat.com   regarding bug verification, do we have some ways to identify whether the ORGANIZATION_ID comes from cached or a OCM fetch? Asking since I didn't see useful logs will be printed either in FetchOrganizationID or GetOrganizationID.

            YaDan Pei added a comment -

            waiting for a new 4.17 nightly which includes the fix

            YaDan Pei added a comment - waiting for a new 4.17 nightly which includes the fix

            YaDan Pei added a comment -

            jhadvig@redhat.com 

            It looks like once the ORGANIZATION_ID is override, it can not be reset to correct organization_id fetched from OCM(by default, ORGANIZATION_ID is correctly set)

            For example, if I test with following steps:

            1. Add custom organization id in `console.operator/cluster` via telemetry annotation

            telemetry.console.openshift.io/ORGANIZATION_ID: 2H2PQMDtwzAUsjPxgoWRjhSpNGD

            Or

            2. Add a new data in cm/telemetry-config
            ORGANIZATION_ID: 3H3PQMDtwzAUsjPxgoWRjhSpNGD

            3. Remove the customized organization ID from telemetry annotation of console.operator/cluster and cm/telemetry-config

            However the ORGANIZATION_ID in cm/console-config will be always kept as 3H3PQMDtwzAUsjPxgoWRjhSpNGD(actually the correct organization ID should be 1H1PQMDtwzAUsjPxgoWRjhSpNGD) We can see a request was sent to OCM and response is successfully returned

             

            I doubt whether this is a valid scenario since reading from slack thread it looks like we should fetch the organization id only once? Could you help double confirm or help me clarify?

             

            YaDan Pei added a comment - jhadvig@redhat.com   It looks like once the ORGANIZATION_ID is override, it can not be reset to correct organization_id fetched from OCM(by default, ORGANIZATION_ID is correctly set) For example, if I test with following steps: 1. Add custom organization id in `console.operator/cluster` via telemetry annotation telemetry.console.openshift.io/ORGANIZATION_ID: 2H2PQMDtwzAUsjPxgoWRjhSpNGD Or 2. Add a new data in cm/telemetry-config ORGANIZATION_ID: 3H3PQMDtwzAUsjPxgoWRjhSpNGD 3. Remove the customized organization ID from telemetry annotation of console.operator/cluster and cm/telemetry-config However the ORGANIZATION_ID in cm/console-config will be always kept as 3H3PQMDtwzAUsjPxgoWRjhSpNGD(actually the correct organization ID should be 1H1PQMDtwzAUsjPxgoWRjhSpNGD) We can see a request was sent to OCM and response is successfully returned   I doubt whether this is a valid scenario since reading from slack thread it looks like we should fetch the organization id only once? Could you help double confirm or help me clarify?  

              jhadvig@redhat.com Jakub Hadvig
              jhadvig@redhat.com Jakub Hadvig
              YaDan Pei YaDan Pei
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated:
                Resolved: