Uploaded image for project: 'Project Quay'
  1. Project Quay
  2. PROJQUAY-5074

Health checks should check storage engines

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Normal
    • None
    • quay-v3.8.0
    • quay
    • False
    • None
    • False
    • Quay Enterprise
    • 0

    Description

      The GET /health/instance health check currently does not check whether distributed storage is up and running and is available. This is also the check which we use for kube-probe. If an instance is using distributed storage and that storage fails, Quay should also fail because pushes and pulls won't work anymore.

      GET /health/endtoend does check distributed storage engines but it only checks preferred storage engine. In georeplicated environments, the endtoend health should check all defined storage engines, not just the preferred one. For instance:

      root@cyberdyne:~# curl https://quay.skynet/health/endtoend
      {"data":{"services":{"auth":true,"database":true,"redis":true,"storage":true}},"status_code":200}
      

      I have two defined storage engines, and one of them is currently offline.

      Suggestion: GET /health/instance checks local (preferred) distributed storage engine. GET /health/endtoend checks all defined storage engines (regardless of their geolocation).

      Additional clarification on the storage health checks:

      GET /health/instance checks only the local storage engine and should fail the pod only if the local storage engine has issues (the preferred one). - this should continue to be used as the health checks for all Quay pods

      GET /health/endtoend should fail only if all services fail. In terms of storage, that means all storage engines. If N-1 storage engines fail, the health check should return a 200 with a warning saying service is degraded and a list of storage engines that are not available. Local instances whose storage engines failed should already be removed from the load balancing scheme because their instance check already failed.

      Benefits:

      • in geo-replication, in case a single site's storage fails, the user will have have to intervene and shut down Quay in the side, the global load balancer will automatically redirect traffic to healthy quay pods in sites with functioning storage
      • in geo-replication, customers can create monitoring rules around the /health/endtoend health check so that it can detect partial storage engine failures 

      Attachments

        Activity

          People

            sleesinc Kenny Lee Sin Cheong
            rhn-support-ibazulic Ivan Bazulic
            Votes:
            6 Vote for this issue
            Watchers:
            14 Start watching this issue

            Dates

              Created:
              Updated: