Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-10728

GCP usage api response include other projects and can causes negative quota calculation

    XMLWordPrintable

Details

    • Critical
    • Sprint 234, Sprint 235
    • 2
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

    Description

      Solution: 

      PR posted at https://github.com/openshift/installer/pull/7018

      The summary of findings is that timeseries returned by 
      serviceruntime.googleapis.com/quota/allocation/usage can return other project_ids than intended.
       
      With the filters added in the PR, only usages in the intended project will be used to match with limits of the same project.
       
      Previously, it is possible to by chance have limits from intended project, and usage from an incorrect project accessible by the account that can lead to incorrect quota calculation (including potentially negative quota) that prevents an otherwise capable account from launching a cluster.

      ------

      Description of problem:

      error(MissingQuota): compute.googleapis.com/networks is not available in global because the required number of resources (1) is more than remaining quota of -negativenumber
      

      During debugging of installer master branch, it is possible for quota calculated for networks service to be negative.
      Watching the response, it is shown that the limits and the usage are coming from different projects in the account. If limits/usage were from the account it, the calculation would not have been negative.

      LimitFromAccountWithLowerLimit - UsageFromAccountWithBiggerLimits = negativeQuota

      The response for compute.googleapis.com/networks that is used for usage when inspected are from a different google cloud project than what is selected and used for limits.
      https://github.com/openshift/installer/blob/9b44e67f7a5afb953712a12bc753cbb8fc12c9a4/pkg/quota/gcp/usage.go#L35

      The timeseries' label project_id show value "openshift-gce-devel-ci" that is different than what was selected "openshift-gce-devel" for use during installation. See debug screenshots.

      Version-Release number of selected component (if applicable):

      [installer from source(master) 7667b46 |https://github.com/openshift/installer/commit/7667b46bd13a6091d8dd7791e77235adfa161935]
      openshift-install unreleased-master-7903-g7667b46bd13a6091d8dd7791e77235adfa161935
      built from commit 7667b46bd13a6091d8dd7791e77235adfa161935
      release image registry.ci.openshift.org/origin/release:4.13
      release architecture amd64
      

      The same issue also occurred on 4.12.z release of the installer, meaning it's not a new regression.

      How reproducible:

      Once I got it to do once, it can repeatably be reproduced by reusing the same command. However, since I have to get stuff done, I try to get out of it by choosing a different --dir and/or cluster name which would sometimes work to get me out of it.
      

      Steps to Reproduce:

      (I can't actually verify but this is the last three things I did when this occured.)
      
      1. openshift-install create cluster --dir clusters/gcp-dev1 --log-level=debug
      
      FATAL failed to fetch Terraform Variables: failed to fetch dependency of "Terraform Variables": failed to generate asset "Platform Provisioning Check": metadata.name: Invalid value: "tkaovila": record api.tkaovila.gcp.devcluster.openshift.com. already exists in DNS Zone (openshift-gce-devel/devcluster) and might be in use by another cluster, please remove it to continue 
      
      2. openshift-install destroy cluster --dir clusters/gcp-dev1 --log-level=debug
      
      INFO Time elapsed: 27s                            
      INFO Uninstallation complete! 
      
      3. openshift-install create cluster --dir clusters/gcp-dev00 --log-level=debug
      
      FATAL failed to fetch Cluster: failed to fetch dependency of "Cluster": failed to generate asset "Platform Quota Check": error(MissingQuota): compute.googleapis.com/networks is not available in global because the required number of resources (1) is more than remaining quota of -143 
      
      

      Actual results:

      FATAL failed to fetch Cluster: failed to fetch dependency of "Cluster": failed to generate asset "Platform Quota Check": error(MissingQuota): compute.googleapis.com/networks is not available in global because the required number of resources (1) is more than remaining quota of -143
      

      Expected results:

      cluster created
      

      Additional info:

       

      Slack thread that link other threads and info
      https://redhat-internal.slack.com/archives/C68TNFWA2/p1679452931661929

      Image from debugging.
      https://drive.google.com/file/d/1O-LRZt5cgglkG-AnXWXhKjytyRyLGvgr/view?usp=share_link

      https://drive.google.com/file/d/1_D2NpI9SBGGlonCDFx6rctwX_gvM1l36/view?usp=share_link

       

      The general instructions for happy path installations came from https://docs.google.com/document/d/1qm37EKkjgoPtjW4909UClzvsjQO5VSpPUvFO_hW_PEg/edit

       

      Happened to at least one other person

      https://redhat-internal.slack.com/archives/CBUT43E94/p1678954988339509?thread_ts=1678954988.339509&cid=CBUT43E94

      and one more

      https://redhat-internal.slack.com/archives/CBUT43E94/p1675196061551869

      and one more

      https://redhat-internal.slack.com/archives/CFUGK0K9R/p1665506381763189?thread_ts=1663621656.170959&cid=CFUGK0K9R

       

       

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tkaovila@redhat.com Tiger Kaovilai
              Jianli Wei Jianli Wei
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: