Steps to reproduce:
{
"metric":
{
"__name__": "nodehealthcheck_ongoing_remediation",
"container": "kube-rbac-proxy",
"endpoint": "https",
"exported_instance": "x.x.x.x:yy",
"exported_job": "node-healthcheck-controller-manager-metrics-service",
"exported_namespace": "openshift-workload-availability",
"instance": "my-redacted-instance:my-port",
"job": "prometheus-federate-job",
"name": "<node-name>", // This is the node name
"namespace": "openshift-workload-availability",
"pod": "node-healthcheck-controller-manager-7f6475d7d4-zvrxw",
"prometheus": "openshift-monitoring/k8s",
"prometheus_replica": "prometheus-k8s-0",
"remediation": "SelfNodeRemediation",
"service": "node-healthcheck-controller-manager-metrics-service"
}
,
"value": [
1771598863.253,
"1" // Value is "1" for Node name
]
},
{
"metric":
{
"__name__": "nodehealthcheck_ongoing_remediation",
"container": "kube-rbac-proxy",
"endpoint": "https",
"exported_instance": "x.x.x.x:yy",
"exported_job": "node-healthcheck-controller-manager-metrics-service",
"exported_namespace": "openshift-workload-availability",
"instance": "my-redacted-instance:my-port",
"job": "prometheus-federate",
"name": "<node-name>-mh7vw", // This is the remediation CR
"namespace": "openshift-workload-availability",
"pod": "node-healthcheck-controller-manager-7f6475d7d4-zvrxw",
"prometheus": "openshift-monitoring/k8s",
"prometheus_replica": "prometheus-k8s-0",
"remediation": "SelfNodeRemediation",
"service": "node-healthcheck-controller-manager-metrics-service"
}
,
"value": [
1771598863.253,
"0" // Value is "0" for remediation CR
]
}
Looks like the issue was introduced in this PR: #231 , when
metrics.ObserveNodeHealthCheckRemediationDeleted(node.GetName(), remediationCR.GetNamespace(), remediationCR.GetKind())
was moved to deleteRemediationCR method:
metrics.ObserveNodeHealthCheckRemediationDeleted(remediationCR.GetName(), remediationCR.GetNamespace(), remediationCR.GetKind())