Uploaded image for project: 'OpenShift API for Data Protection'
  1. OpenShift API for Data Protection
  2. OADP-145

Restic Restore stuck on InProgress status when app is deployed with DeploymentConfig

XMLWordPrintable

    • False
    • False
    • oadp-velero-plugin-container-1.1.0-17, oadp-operator-container-1.1.0-40
    • Passed
    • OADP Sprint 218
    • 1
    • 0
    • 0
    • 0
    • Untriaged
    • None

      Problem Description: 

      Not 100% sure this is the root cause, but it seems like restic gets stuck on restore when app is deployed by DeploymentConfig.

      • Doesn't happened with 2 stateful apps that do not have DeploymentConfig
      • Tried other 3 apps with deploymentconfig - all have the same behavior.
      • Issue doesn't occur when using volume snapshot

      Observed Results:

      (mtc-e2e-venv) [mperetz@mperetz mtc-e2e-qev2]$ oc get restore -n openshift-adp -o yaml
      apiVersion: v1
      items:
      - apiVersion: velero.io/v1
        kind: Restore
        metadata:
          creationTimestamp: "2021-11-30T16:08:02Z"
          generation: 11
          name: mongodb123
          namespace: openshift-adp
          resourceVersion: "915253"
          uid: ddbf6e75-ec7f-442b-93c5-778af79d52f5
        spec:
          backupName: mongodb123
          excludedResources:
          - nodes
          - events
          - events.events.k8s.io
          - backups.velero.io
          - restores.velero.io
          - resticrepositories.velero.io
          restorePVs: true
        status:
          phase: InProgress
          progress:
            itemsRestored: 43
            totalItems: 43
          startTimestamp: "2021-11-30T16:08:02Z"
      kind: List
      metadata:
        resourceVersion: ""
        selfLink: ""
      (mtc-e2e-venv) [mperetz@mperetz mtc-e2e-qev2]$ oc get restore -n openshift-adp 
      NAME         AGE
      mongodb123   7m15s
      (mtc-e2e-venv) [mperetz@mperetz mtc-e2e-qev2]$ 
       

       

      Getting these errors from velero:

      time="2021-11-30T16:24:48Z" level=info msg="Backup storage location is invalid, marking as unavailable" backup-storage-location=default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:117"
      time="2021-11-30T16:24:48Z" level=error msg="Current backup storage locations available/unavailable/unknown: 0/1/0, Backup storage location \"default\" is unavailable: rpc error: code = Unknown desc = AccessDenied: Access Denied\n\tstatus code: 403, request id: NZFNS1E4YBSA2C2R, host id: bFDBsSSnwrMsck8QvzR3QJ5enMszisF8RTdVcD+l+ui5FqPrnAyoHKpqqkMMQTTBmDyq1iKntZs=)" controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:164"
      time="2021-11-30T16:24:48Z" level=error msg="Current backup storage locations available/unavailable/unknown: 0/1/0)" controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:166"
       

       

      Restic logs don't say much:

      (mtc-e2e-venv) [mperetz@mperetz mtc-e2e-qev2]$ oc logs daemonset.apps/restic -n openshift-adp 
      Found 6 pods, using pod/restic-nwgjp
      time="2021-11-30T16:05:14Z" level=info msg="Setting log-level to INFO"
      time="2021-11-30T16:05:14Z" level=info msg="Starting Velero restic server konveyor-dev (-)" logSource="pkg/cmd/cli/restic/server.go:87"
      2021-11-30T16:05:14.496Z    INFO    controller-runtime.metrics    metrics server is starting to listen    {"addr": ":8080"}
      time="2021-11-30T16:05:14Z" level=info msg="Starting controllers" logSource="pkg/cmd/cli/restic/server.go:198"
      time="2021-11-30T16:05:14Z" level=info msg="Starting metric server for restic at address [:8085]" logSource="pkg/cmd/cli/restic/server.go:189"
      time="2021-11-30T16:05:14Z" level=info msg="Controllers starting..." logSource="pkg/cmd/cli/restic/server.go:249"
      2021-11-30T16:05:14.552Z    INFO    controller-runtime.manager    starting metrics server    {"path": "/metrics"}
      time="2021-11-30T16:05:14Z" level=info msg="Starting controller" controller=pod-volume-backup logSource="pkg/controller/generic_controller.go:76"
      time="2021-11-30T16:05:14Z" level=info msg="Waiting for caches to sync" controller=pod-volume-backup logSource="pkg/controller/generic_controller.go:81"
      time="2021-11-30T16:05:14Z" level=info msg="Starting controller" controller=pod-volume-restore logSource="pkg/controller/generic_controller.go:76"
      time="2021-11-30T16:05:14Z" level=info msg="Waiting for caches to sync" controller=pod-volume-restore logSource="pkg/controller/generic_controller.go:81"
      time="2021-11-30T16:05:14Z" level=info msg="Caches are synced" controller=pod-volume-restore logSource="pkg/controller/generic_controller.go:85"
      time="2021-11-30T16:05:14Z" level=info msg="Caches are synced" controller=pod-volume-backup logSource="pkg/controller/generic_controller.go:85"
       

      Restic pods are always in ready and running:
      **

      NAME                                                  READY   STATUS    RESTARTS   AGE
      oadp-example-velero-1-aws-registry-554545f7d6-99spc   1/1     Running   0          69m
      openshift-adp-controller-manager-d79f5fcd6-8lhz9      2/2     Running   0          132m
      restic-27v2p                                          1/1     Running   0          69m
      restic-56nqh                                          1/1     Running   0          69m
      restic-bkx25                                          1/1     Running   0          69m
      restic-h2pfk                                          1/1     Running   0          69m
      restic-vchhp                                          1/1     Running   0          69m
      restic-ws54h                                          1/1     Running   0          69m
      velero-769494ddb9-86lgq                               1/1     Running   0          2m57s
       

      Version: 0.5.0

      Steps to reproduce:

      1. clone this repo https://gitlab.cee.redhat.com/app-mig/cam-e2e-qe
      2. run some playbook to deploy app with DeploymentConfig, for example:
        ansible-playbook cam-e2e-qe/deploy-app.yml -e use_role=roles/ocp-redis/ -e namespace=redis-ns
      3. Create a backup:
        cat <<EOF | oc create -f -
        apiVersion: velero.io/v1
        kind: Backup
        metadata:
          name: redis-ns
          labels:
            velero.io/storage-location: example-velero-1
          namespace: openshift-adp
        spec:
          hooks: {}
          includedNamespaces:
          - redis-ns
          storageLocation: example-velero-1 
          defaultVolumesToRestic: true
          snapshotVolumes: false
          ttl: 720h0m0s
        EOF
      4. delete the project
        oc delete project redis-ns
      5. create restore:
        cat <<EOF | oc create -f -
        apiVersion: velero.io/v1
        kind: Restore
        metadata:
          name: redis-ns
          namespace: openshift-adp
        spec:
          backupName: redis-ns
          excludedResources:
          - nodes
          - events
          - events.events.k8s.io
          - backups.velero.io
          - restores.velero.io
          - resticrepositories.velero.io
          restorePVs: true
        EOF
      6. Note that the restore is stuck after all items seem to be restored on InProgress

              sseago Scott Seago
              mperetz@redhat.com Maya Peretz
              Prasad Joshi Prasad Joshi
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated:
                Resolved: