Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-7240

TALM backup - recovery script fails due to unable to find running container even though it is running

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Undefined Undefined
    • None
    • 4.11.z
    • TALM Operator
    • Important
    • None
    • CNF RAN Sprint 232
    • 1
    • Proposed
    • False
    • Hide

      None

      Show
      None

      This is a clone of issue OCPBUGS-6944. The following is the description of the original issue:

      Description of problem:

      recovery scripted failed with following stderr, even though etcd container is already running. 
      
      $ upgrade-recovery.sh --resume
      
      "##### Tue Jan 31 22:40:55 UTC 2023: Waiting for etcd container to restart",
      ##### Tue Jan 31 23:01:03 UTC 2023: etcd container is not Running. Please investigate"
      
      
      
      The error is from this line, where it assumes specific column from crictl cmd. This mechanism is unreliable in general. Suggest to compose the cmd with explicit/specific columns or finding matching column by comparing the header line.
      https://github.com/openshift-kni/cluster-group-upgrades-operator/blob/release-4.12/recovery/bindata/upgrade-recovery.sh#L114

      Version-Release number of selected component (if applicable):

      4.11.25, 4.12.z

      How reproducible:

      100%

      Steps to Reproduce:

      1. Enable ocp upgrade with backup via TALM
      2. Before ocp upgrade completes, start recovery via: upgrade-recovery.sh
      3. Start second phase of recovery via: upgrade-recovery.sh --resume 

      Actual results:

      Second phase of recovery fails at finding Running etcd container 
      
      - due to script issue at https://github.com/openshift-kni/cluster-group-upgrades-operator/blob/release-4.12/recovery/bindata/upgrade-recovery.sh#L114
      

      Expected results:

      Second phase of recovery passes.

      Additional info:

      Manually tried the following, where first cmd was from the product code, and second is the working cmd.
      [root@cnfde17 core]# name=etcd
      [root@cnfde17 core]# crictl ps 2>/dev/null | awk -v name="${name}" '{if ($(NF-2) == name) {print $(NF-3); exit 0}}'
      [root@cnfde17 core]# crictl ps 2>/dev/null | awk -v name="${name}" '{if ($(NF-3) == name) {print $(NF-4); exit 0}}'
      Running
      
      
      

              jche@redhat.com Jun Chen
              openshift-crt-jira-prow OpenShift Prow Bot
              Yang Liu Yang Liu
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: