Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-26404

PSM should try best to get the infrastructure status

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.14.z, 4.15.0, 4.16
    • OLM
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Low
    • No
    • None
    • None
    • Rejected
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Now, the logic is that if it cannot get the infrastructure status, it will use the HA values, code: https://github.com/openshift/operator-framework-olm/blob/master/pkg/leaderelection/leaderelection.go#L59-L63 

      MacBook-Pro:must-gather-sno2 jianzhang$ omg logs package-server-manager-dc4dd8c64-7jw5z |grep "unable to get cluster infrastructure status" 
      2023-12-30T22:03:30.690296328Z 2023-12-30T22:03:30Z	ERROR	setup	unable to get cluster infrastructure status, using HA cluster values for leader election	{"error": "Get \"https://172.30.0.1:443/apis/config.openshift.io/v1/infrastructures/cluster\": context deadline exceeded"} 

      But, it can get the infrastructure status successfully later, so I'm curious if we can add a retry for it, thanks!

      The test log: https://reportportal-openshift.apps.ocp-c1.prod.psi.redhat.com/ui/#prow/launches/all/488495/57309118/57309876/log?item1Params=filter.eq.hasStats%3Dtrue%26filter.eq.hasChildren%3Dfalse%26filter.in.type%3DSTEP%26filter.in.status%3DFAILED%252CINTERRUPTED

      Dec 30 22:47:56.116: INFO: Running 'oc --kubeconfig=/tmp/kubeconfig-3268075730 get lease packageserver-controller-lock -n openshift-operator-lifecycle-manager -o=jsonpath={.spec.leaseDurationSeconds}'
      ...
      Dec 30 22:47:56.222: INFO: This is a SNO cluster
      ...
      fail [github.com/openshift/openshift-tests-private/test/extended/operators/olm.go:868]: The lease duration is not as expected: 137 

      The test case: https://github.com/openshift/openshift-tests-private/blob/master/test/extended/operators/olm.go#L803-L822 

      g.It("NonHyperShiftHOST-Author:jiazha-Medium-49352-SNO Leader election conventions for cluster topology", func() {
              exutil.By("1) get the cluster topology")
              infra, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("infrastructures", "cluster", "-o=jsonpath={.status.controlPlaneTopology}").Output()
              if err != nil {
                  e2e.Failf("Fail to get the cluster infra: %s, error:%v", infra, err)
              }
              exutil.By("2) get the leaseDurationSeconds of the packageserver-controller-lock")
              leaseDurationSeconds, err := oc.AsAdmin().WithoutNamespace().Run("get").Args("lease", "packageserver-controller-lock", "-n", "openshift-operator-lifecycle-manager", "-o=jsonpath={.spec.leaseDurationSeconds}").Output()
              if err != nil {
                  e2e.Failf("Fail to get the leaseDurationSeconds: %s, error:%v", leaseDurationSeconds, err)
              }
              if infra == "SingleReplica" {
                  e2e.Logf("This is a SNO cluster")
                  if !strings.Contains(leaseDurationSeconds, "270") {
                      e2e.Failf("The lease duration is not as expected: %s", leaseDurationSeconds)
                  }
              } else {
                  g.Skip("This is a HA cluster, skip.")
              }
          })

      The must-gather log: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-baremetal-pxe-sno-agent-ipv4-static-connected-f14/1741208355214462976/artifacts/baremetal-pxe-sno-agent-ipv4-static-connected-f14/gather-must-gather/artifacts/

      MacBook-Pro:~ jianzhang$ omg get infrastructures cluster -o yaml
      ...
      spec:
        cloudConfig:
          name: ''
        platformSpec:
          type: None
      status:
        apiServerInternalURI: https://api-int.ci-op-h2xyljb0.XXXXXXXXXXXXXXXXXXXXXXXXXXXXX:6443
        apiServerURL: https://api.ci-op-h2xyljb0.XXXXXXXXXXXXXXXXXXXXXXXXXXXXX:6443
        controlPlaneTopology: SingleReplica
        cpuPartitioning: None
        etcdDiscoveryDomain: ''
        infrastructureName: ci-op-h2xyljb0-qshsl
        infrastructureTopology: SingleReplica
        platform: None
        platformStatus:
          type: None
      
      NAME       STATUS  ROLES                              AGE    VERSION
      master-00  Ready   control-plane,master,worker,wscan  4h29m  v1.27.8+4fab27b 

              rh-ee-cchantse Catherine Chan-Tse
              rhn-support-jiazha Jian Zhang
              None
              None
              Jian Zhang Jian Zhang
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: