Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-17240

restore with snapshot bump returns old data

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Normal Normal
    • None
    • 4.14
    • Etcd
    • None
    • Low
    • No
    • 5
    • ETCD Sprint 240, ETCD Sprint 241, ETCD Sprint 242
    • 3
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      With https://github.com/openshift/origin/pull/28073 we have introduced the upstream feature of bumping snapshot revisions. 
      
      This drastically improved our pass rates and our restore procedure:
      
      https://sippy.dptools.openshift.org/sippy-ng/jobs/4.14/analysis?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-cluster-etcd-operator-release-4.14-periodics-e2e-aws-etcd-recovery%22%7D%5D%7D
      
      Sometimes however, there are assertions failing like:
      > fail [github.com/openshift/origin/test/extended/dr/resource_assertions.go:98]: Expected an error to have occurred.  Got:
          <nil>: nil
      
      Which indicates that a namespace that should not be included in the snapshot was indeed retrieved from the API after a restore:
      
      https://github.com/openshift/origin/blob/6ee9dc56a612a4c886d094571832ed47efa2e831/test/extended/dr/resource_assertions.go#L97-L99
      
      This should obviously not happen, this namespace should not be found.
      
      run: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-cluster-etcd-operator-release-4.14-periodics-e2e-aws-etcd-recovery/1685868016341880832
      
      
      
      

      Version-Release number of selected component (if applicable):

      4.14

      How reproducible:

      initially rarely, now fairly often

      Steps to Reproduce:

      1. run the e2e recovery test multiple times
      
      

      Actual results:

      test finds resources that should've been not existing

      Expected results:

      the test does *not* find the resources that are not included in the snapshot

      Additional info:

      We have two possible explanations: 
      * etcd does indeed contain that namespace, which should be easily tested with etcdctl (meaning that our snapshot must be wrong or the restore procedure is picking up a WAL that's left over with those changes)
      * api server still serves the stuff from its cache 
      
      Or, of course, the assertion is wrong :)  
      
      Low priority because that's not "officially" shipped in 4.14.

       

            tjungblu@redhat.com Thomas Jungblut
            tjungblu@redhat.com Thomas Jungblut
            ge liu ge liu
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: