Uploaded image for project: 'OpenShift Migration Toolkit for Containers'
  1. OpenShift Migration Toolkit for Containers
  2. MIG-1707

Migration stuck after StageBackup when terminating pods fails in source cluster

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Blocker Blocker
    • MTC 1.8.6
    • MTC 1.8.6
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ToDo
    • Critical

      Description

      During a migration, the process gets stuck after the StageBackup phase when migrating a Django + PostgreSQL application.

      All pods in the source namespace are healthy before the migration begins,

      oc get pods
      NAME                              READY   STATUS      RESTARTS   AGE
      django-psql-persistent-1-build    0/1     Completed   0          94s
      django-psql-persistent-1-deploy   0/1     Completed   0          58s
      django-psql-persistent-1-sthtg    1/1     Running     0          58s
      postgresql-1-deploy               0/1     Completed   0          93s
      postgresql-1-ggrnx                1/1     Running     0          92s 

      After the migration with quiesce checked is started and when terminating the pods on the source cluster the Django pod fails with a CrashLoopBackOff due to a psycopg2.OperationalError, caused by trying to connect to PostgreSQL before it's ready.

      oc get pods
      NAME                              READY   STATUS             RESTARTS      AGE
      django-psql-persistent-1-build    0/1     Completed          0             5m19s
      django-psql-persistent-1-deploy   0/1     Completed          0             4m43s
      django-psql-persistent-1-sthtg    0/1     CrashLoopBackOff   4 (20s ago)   4m43s
      postgresql-1-deploy               0/1     Completed          0             5m18s 

      The migration only resumes and completes successfully after manually deleting the source namespace.

      as for the pods on the target: it was observed that the same pods that caused the error on the source when terminated, sometimes it's healthey on the target cluster and other times it was observed to be crashing:

       

        Warning  BackOff         33s (x9 over 106s)  kubelet            Back-off restarting failed container django-psql-persistent in pod django-psql-persistent-2-jw68s_ocp-42282-django(d8caad6b-d9d0-4979-9329-deb8bad360b3) 

       

       

      Steps to reproduce

      1. Deploy a Django + PostgreSQL application in OpenShift (e.g., using the django-psql-persistent template).
      2. Confirm all pods are healthy in the source cluster
      3. Create a migration plan, and execute it
      4. Observe that the Django pod in the source namespace enters CrashLoopBackOff and logs:
        psycopg2.OperationalError: could not connect to server: Connection refused
      5. Observe the migation is stuck after the stagebackup phase
      6. Manually delete the source namespace.
      7. Observe that the migration proceeds successfully and (often) the app stabilizes in the target cluster.

              rhn-support-awels Alexander Wels
              midays mohamed idays
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: