Uploaded image for project: 'Project Quay'
  1. Project Quay
  2. PROJQUAY-1389

Quay TNG Operator should be improved to be fault tolerant

XMLWordPrintable

      Description:

      This is an issue found when use Quay TNG Operator to deploy Quay, now after create Quay Registry CR to trigger Quay deployment, one OCP worker node is suddenly not available to schedule to create POD, then some of Quay PODs are in pending status, like "Quay-app-upgrade POD", TNG Operator is waiting for the Quay-app-upgrade deployment to be ready in specified time, some minutes later the OCP worker node is recovered to schedule to create POD, but TNG Operator already report following error, and will not to move on to create quay-app POD and start remaining resources to complete quay deployment, attached the TNG Operator Logs.

      The issue here is OCP can have fault tolerance to start POD when its worker Node is recovered, but Quay TNG Operator didn't have the ability to have fault tolerance to continue to complete the quay deployment.

      Quay TNG Operator Image:

      [root@ip-10-0-1-60 centos]# oc get pod quay-operator-86d66598b8-rbpx4 -o json | jq '.spec.containers[0].image'
      "registry.redhat.io/quay/quay-rhel8-operator@sha256:079cae91a19aa3dde547c399c8f7478aa9cc437aadc9a309c4b313e39f743d24"
      
      [root@ip-10-0-1-60 centos]# oc get pod
      NAME                                               READY   STATUS                  RESTARTS   AGE
      quay-operator-86d66598b8-rbpx4                     1/1     Running                 0          89m
      quayregistry-clair-app-d547b885c-27svr             1/1     Running                 3          83m
      quayregistry-clair-postgres-58f4b94bbc-6ndld       1/1     Running                 0          83m
      quayregistry-quay-app-upgrade-575bc577dd-z6fkl     1/1     Running                 0          84m
      quayregistry-quay-config-editor-64688f455d-sktfv   1/1     Running                 0          83m
      quayregistry-quay-database-b96c99b55-fcc2m         1/1     Running                 0          83m
      quayregistry-quay-mirror-66d47557fc-7hdnc          0/1     Init:CrashLoopBackOff   14         83m
      quayregistry-quay-redis-d98744d58-xqr7m            1/1     Running                 0          83m
      
      Quay Operator Logs:
      2020-12-15T08:31:12.531Z	ERROR	controllers.QuayRegistry	Quay upgrade deployment never reached ready phase	{"quayregistry": "quay-enterprise/quayregistry", "error": "timed out waiting for the condition"}
      github.com/go-logr/zapr.(*zapLogger).Error
      	/workspace/vendor/github.com/go-logr/zapr/zapr.go:128
      github.com/quay/quay-operator/controllers/quay.(*QuayRegistryReconciler).Reconcile.func1
      	/workspace/controllers/quay/quayregistry_controller.go:329
      
      

              rmarasch@redhat.com Ricardo Maraschini (Inactive)
              lzha1981 luffy zhang
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: