Uploaded image for project: 'OpenShift SDN'
  1. OpenShift SDN
  2. SDN-3060

etcdHighNumberOfLeaderChanges alert firing for extended period

XMLWordPrintable

    • 0
    • 0

      job link

      must-gather

      snippet from test output:

      {  4 events happened too frequently
      
      event happened 22 times, something is wrong: ns/openshift-cluster-storage-operator deployment/csi-snapshot-controller-operator - reason/OperatorStatusChanged Status for clusteroperator/csi-snapshot-controller changed: Progressing message changed from "CSISnapshotControllerProgressing: Waiting for Deployment to deploy csi-snapshot-controller pods" to "CSISnapshotControllerProgressing: Waiting for Deployment to deploy csi-snapshot-controller pods\nCSISnapshotWebhookControllerProgressing: desired generation 2, current generation 1"
      event happened 24 times, something is wrong: ns/openshift-cluster-storage-operator deployment/csi-snapshot-controller-operator - reason/OperatorStatusChanged Status for clusteroperator/csi-snapshot-controller changed: Progressing message changed from "CSISnapshotControllerProgressing: Waiting for Deployment to deploy csi-snapshot-controller pods\nCSISnapshotWebhookControllerProgressing: desired generation 2, current generation 1" to "CSISnapshotControllerProgressing: Waiting for Deployment to deploy csi-snapshot-controller pods"
      event happened 23 times, something is wrong: ns/openshift-cluster-storage-operator deployment/csi-snapshot-controller-operator - reason/OperatorStatusChanged Status for clusteroperator/csi-snapshot-controller changed: Progressing message changed from "CSISnapshotControllerProgressing: Waiting for Deployment to deploy csi-snapshot-controller pods\nCSISnapshotWebhookControllerProgressing: 1 out of 2 pods running" to "CSISnapshotControllerProgressing: Waiting for Deployment to deploy csi-snapshot-controller pods\nCSISnapshotWebhookControllerProgressing: desired generation 2, current generation 1"
      event happened 23 times, something is wrong: ns/openshift-cluster-storage-operator deployment/csi-snapshot-controller-operator - reason/OperatorStatusChanged Status for clusteroperator/csi-snapshot-controller changed: Progressing message changed from "CSISnapshotControllerProgressing: Waiting for Deployment to deploy csi-snapshot-controller pods\nCSISnapshotWebhookControllerProgressing: desired generation 2, current generation 1" to "CSISnapshotControllerProgressing: Waiting for Deployment to deploy csi-snapshot-controller pods\nCSISnapshotWebhookControllerProgressing: 1 out of 2 pods running"}
      

      This failure is coming from a check that "[sig-arch] events should not repeat pathologically". Essentially looking for troubling events
      that occur more than X (I think it's 20) number of times. This particular issue around CSISnapshotWebhookControllerProgressing seems
      to happen in our ovn-upgrade job periodicially. I only saw one case of it happening in an openshift-sdn job and that was in a
      slightly more complicated upgrade-rollback job. I think this one is worth chasing down since it seems someone affected by
      ovn.

      There was a bug matching this kind of problem filed back in March but it has since been
      marked RESOLVED so I don't think anyone is actively looking at this any more.

      here's a search.ci link that shows all the jobs that have this problem for our aws-ovn-upgrade jobs over the last 7 days.

      link to this job's testgrid for reference.

       

              jluhrsen Jamo Luhrsen
              jluhrsen Jamo Luhrsen
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: