Uploaded image for project: 'Red Hat Workload Availability'
  1. Red Hat Workload Availability
  2. RHWA-751

nodehealthcheck_ongoing_remediation metric stuck at 1 after remediation completes due to mismatched label sets

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • None
    • Node Healthcheck
    • None
    • False
    • Hide

      None

      Show
      None
    • False

      Steps to reproduce:
            {
              "metric":

      {           "__name__": "nodehealthcheck_ongoing_remediation",           "container": "kube-rbac-proxy",           "endpoint": "https",           "exported_instance": "x.x.x.x:yy",           "exported_job": "node-healthcheck-controller-manager-metrics-service",           "exported_namespace": "openshift-workload-availability",           "instance": "my-redacted-instance:my-port",           "job": "prometheus-federate-job",           "name": "<node-name>",    // This is the node name           "namespace": "openshift-workload-availability",           "pod": "node-healthcheck-controller-manager-7f6475d7d4-zvrxw",           "prometheus": "openshift-monitoring/k8s",           "prometheus_replica": "prometheus-k8s-0",           "remediation": "SelfNodeRemediation",           "service": "node-healthcheck-controller-manager-metrics-service"         }

      ,
              "value": [
                1771598863.253,
                "1"   // Value is "1" for Node name
              ]
            },
            {
              "metric":

      {           "__name__": "nodehealthcheck_ongoing_remediation",           "container": "kube-rbac-proxy",           "endpoint": "https",           "exported_instance": "x.x.x.x:yy",           "exported_job": "node-healthcheck-controller-manager-metrics-service",           "exported_namespace": "openshift-workload-availability",           "instance": "my-redacted-instance:my-port",           "job": "prometheus-federate",           "name": "<node-name>-mh7vw",   // This is the remediation CR           "namespace": "openshift-workload-availability",           "pod": "node-healthcheck-controller-manager-7f6475d7d4-zvrxw",           "prometheus": "openshift-monitoring/k8s",           "prometheus_replica": "prometheus-k8s-0",           "remediation": "SelfNodeRemediation",           "service": "node-healthcheck-controller-manager-metrics-service"         }

      ,
              "value": [
                1771598863.253,
                "0"  // Value is "0" for remediation CR
              ]
            }

      Looks like the issue was introduced in this PR: #231 , when
      metrics.ObserveNodeHealthCheckRemediationDeleted(node.GetName(), remediationCR.GetNamespace(), remediationCR.GetKind())

      was moved to deleteRemediationCR method:
      metrics.ObserveNodeHealthCheckRemediationDeleted(remediationCR.GetName(), remediationCR.GetNamespace(), remediationCR.GetKind())

              Unassigned Unassigned
              rh-ee-mhabash Michael Habash
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: