• Icon: Epic Epic
    • Resolution: Obsolete
    • Icon: Critical Critical
    • None
    • None
    • Installer Core
    • Installer Telemetry
    • 5
    • installer
    • Done
    • OCPSTRAT-97 - Installer Telemetry Collection
    • OCPSTRAT-97Installer Telemetry Collection
    • 23% To Do, 0% In Progress, 77% Done

      Goal:

      As a product owner for the installer, I want insight into how users run the installer and their rate of success.

      Problem:

      Clusters don’t provide any telemetry until they are fully installed. If a user encounters an error during this process, we won’t know about it unless they notify us. This makes it very difficult to understand the global success rate for installation.

      This is specifically not about the resultant cluster and therefore none of these metrics should be tied to a cluster-id, instead we'll have to create an identifier that tracks a specific installation workflow.

      Why is this important:

      In order to improve the product and improve the rate of success, we need to be able to measure that rate of success and understand what errors are encountered.

      Dependencies (internal and external):

      • Service Delivery needs to create and run the telemetry-gathering service

      Previous Work:

      • Telemeter

      Customers:

      Specifics unknown

       

      Open Questions

      • Compile Examples of data we'd like to send back
      • Confirm whether Telemetry can receive reports or only time series data

            [CORS-1287] Collect telemetry from installer

            THis might be helpful. CCX team recently worked with Assisted Installer team to allow them to upload data and process them thru the CCX pipeline. WOrk is tracked here - https://issues.redhat.com/browse/CCX-220

            Radek Vokal added a comment - THis might be helpful. CCX team recently worked with Assisted Installer team to allow them to upload data and process them thru the CCX pipeline. WOrk is tracked here - https://issues.redhat.com/browse/CCX-220

            Moving the target release now to 4.12.

            Marcos Entenza Garcia added a comment - Moving the target release now to 4.12.

            Changes for 4.11 planning.
            The Openshift 4.11 release number has been removed from Fix Version and has been set for Target version.
            PM will set the Target Version to indicate the request/ask
            Engineering will set the Fix Version once they are ready to commit to the epic.

            Nicole Wilker added a comment - Changes for 4.11 planning. The Openshift 4.11 release number has been removed from Fix Version and has been set for Target version. PM will set the Target Version to indicate the request/ask Engineering will set the Fix Version once they are ready to commit to the epic.

            Jan Holecek mentioned this issue in a merge request of ccx / ccx-ocp-core on branch install-gather:

            Add support for install gathers for troubleshooting failed installations.

            Nomination Bot added a comment - Jan Holecek mentioned this issue in a merge request of ccx / ccx-ocp-core on branch install-gather : Add support for install gathers for troubleshooting failed installations.

            Yang Yang added a comment -

            If I understand correctly, installer only pushes the metrics to pushgateway and it does not care about how/what prometheus scrapes them. Could anyone help me understand how users get the metrics?

            1. If the metrics are pushed to user local pushgateway, would in-cluster prometheus get the metrics? If that's true, how users configure the in-cluster prometheus to scrape from that pushgateway
            2. If the metrics are pushed the default public pushgateway, would any public prometheus get the metrics?

            Yang Yang added a comment - If I understand correctly, installer only pushes the metrics to pushgateway and it does not care about how/what prometheus scrapes them. Could anyone help me understand how users get the metrics? If the metrics are pushed to user local pushgateway, would in-cluster prometheus get the metrics? If that's true, how users configure the in-cluster prometheus to scrape from that pushgateway If the metrics are pushed the default public pushgateway, would any public prometheus get the metrics?

            Adding doc-ack label since there is no doc impact for this epic. 

            Stephanie Stout added a comment - Adding doc-ack label since there is no doc impact for this epic. 

            Rom Freiman added a comment -

            what's the plan? According to the enhancement?

            Rom Freiman added a comment - what's the plan? According to the enhancement?

            During the readout call today we discussed staffing challenges are the primary risk for this epic.  From CEE's perspective this is equally as important as the bootstrap debugging epic: https://issues.redhat.com/browse/CORS-1510

            Brenton Leanhardt added a comment - During the readout call today we discussed staffing challenges are the primary risk for this epic.  From CEE's perspective this is equally as important as the bootstrap debugging epic: https://issues.redhat.com/browse/CORS-1510

            rhn-support-edrich, do you have a document that describes the metrics you mentioned on the readout?

            Brenton Leanhardt added a comment - rhn-support-edrich , do you have a document that describes the metrics you mentioned on the readout?

            This feels ted to me since Patrick will be on leave most of the 4.7 release.  Happy to discuss.

            Brenton Leanhardt added a comment - This feels ted to me since Patrick will be on leave most of the 4.7 release.  Happy to discuss.

            Eric Rich added a comment -

            We need to make sure that we are capturing when an install is started/completed, and how many times its attempted. 

            This is key pieced of data that we are having to guess at to prioritize things like: https://bugzilla.redhat.com/show_bug.cgi?id=1870728

            Eric Rich added a comment - We need to make sure that we are capturing when an install is started/completed, and how many times its attempted.  This is key pieced of data that we are having to guess at to prioritize things like:  https://bugzilla.redhat.com/show_bug.cgi?id=1870728

            This epic does not look like it needs user facing documentation so I am adding the no-doc label. Please let me know if there are any product doc updates required for this epic. Thanks!

            Stephanie Stout added a comment - This epic does not look like it needs user facing documentation so I am adding the no-doc label. Please let me know if there are any product doc updates required for this epic. Thanks!

            Do we want to add work for the dashboard itself here? Working backwards from dashboard would make sure that not only are we collecting data, but we have a way to surface/analyze it.

            Nicholas Stielau added a comment - Do we want to add work for the dashboard itself here? Working backwards from dashboard would make sure that not only are we collecting data, but we have a way to surface/analyze it.

            Great idea, there is already an OPENSHIFT_INSTALL_INVOKER env var supported by the installer that hive sets when we run our install pods. I added a question on https://issues.redhat.com/browse/CORS-1262 to see if it's marked for inclusion in the metrics.

            Devan Goodwin added a comment - Great idea, there is already an OPENSHIFT_INSTALL_INVOKER env var supported by the installer that hive sets when we run our install pods. I added a question on https://issues.redhat.com/browse/CORS-1262 to see if it's marked for inclusion in the metrics.

            Katherine Dubé added a comment - - edited

            rhn-engineering-gshereme mstaeble rhn-engineering-dgoodwin One thing I feel might be worthwhile if there's a way to log that Hive was the invoker. This way we can track Hive adoption/usage among customers. Basically, any metrics we could provide about how it's being used would probably be valuable for making future decisions about the product.

            Katherine Dubé added a comment - - edited rhn-engineering-gshereme mstaeble rhn-engineering-dgoodwin One thing I feel might be worthwhile if there's a way to log that Hive was the invoker. This way we can track Hive adoption/usage among customers. Basically, any metrics we could provide about how it's being used would probably be valuable for making future decisions about the product.

            > That is my read. The installer is the one that would send metrics to the telemetry-gathering service. Hive would not be involved.

            Thanks, Matthew. cc kdube@redhat.com

            Greg Sheremeta (Inactive) added a comment - > That is my read. The installer is the one that would send metrics to the telemetry-gathering service. Hive would not be involved. Thanks, Matthew. cc kdube@redhat.com

            rhn-engineering-gshereme That is my read. The installer is the one that would send metrics to the telemetry-gathering service. Hive would not be involved.

            Matthew Staebler (Inactive) added a comment - rhn-engineering-gshereme That is my read. The installer is the one that would send metrics to the telemetry-gathering service. Hive would not be involved.

            rhn-engineering-dgoodwin mstaeble if I understand the goal here, Hive would be able to make use of this without us really doing anything. Is that your read too, or would we need some kind of additional telemetry to be sent back by Hive?

            Greg Sheremeta (Inactive) added a comment - rhn-engineering-dgoodwin mstaeble if I understand the goal here, Hive would be able to make use of this without us really doing anything. Is that your read too, or would we need some kind of additional telemetry to be sent back by Hive?

              Unassigned Unassigned
              rhn-coreos-acrawfor Alex Crawford (Inactive)
              Jianli Wei Jianli Wei
              Votes:
              1 Vote for this issue
              Watchers:
              27 Start watching this issue

                Created:
                Updated:
                Resolved: