-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.21.0
-
None
Description of problem:
4.21.0-0.nightly-2025-11-22-193140 cluster, all machine-config-daemon targets are down
$ token=`oc create token prometheus-k8s -n openshift-monitoring`
$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=ALERTS{alertname="TargetDown",namespace="openshift-machine-config-operator"}' | jq {
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"__name__": "ALERTS",
"alertname": "TargetDown",
"alertstate": "firing",
"job": "machine-config-daemon",
"namespace": "openshift-machine-config-operator",
"prometheus": "openshift-monitoring/k8s",
"service": "machine-config-daemon",
"severity": "warning"
},
"value": [
1764061506.224,
"1"
]
}
],
"analysis": {}
}
}
admin user go to web console, "Observe - > Targets", search openshift-machine-config-operator targets, all machine-config-daemon targets are down, click one down target, error is
Get "https://10.0.50.248:9001/metrics": tls: failed to verify certificate: x509: certificate is valid for kube-rbac-proxy.openshift-machine-config-operator.svc, kube-rbac-proxy.openshift-machine-config-operator.svc.cluster.local, not machine-config-daemon.openshift-machine-config-operator.svc
see picture: https://drive.google.com/file/d/1UVE9a-oX3pnQKkoMnFoLCKZSdWcfRZGJ/view?usp=drive_link
checked servicemonitor, serverName is machine-config-daemon.openshift-machine-config-operator.svc
$ oc -n openshift-machine-config-operator get servicemonitor machine-config-daemon -oyaml ... spec: endpoints: - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token interval: 30s path: /metrics port: metrics relabelings: - action: replace regex: ;(.*) replacement: $1 separator: ; sourceLabels: - node - __meta_kubernetes_pod_node_name targetLabel: node scheme: https tlsConfig: caFile: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt serverName: machine-config-daemon.openshift-machine-config-operator.svc ...
but from cert, it's for DNS:kube-rbac-proxy.openshift-machine-config-operator.svc, DNS:kube-rbac-proxy.openshift-machine-config-operator.svc.cluster.local, not for machine-config-daemon.openshift-machine-config-operator.svc, match with the error on targets page
$ oc -n openshift-machine-config-operator get pod -o wide | grep machine-config-daemon | grep 10.0.50.248 machine-config-daemon-ktlft 2/2 Running 0 8h 10.0.50.248 ip-10-0-50-248.us-east-2.compute.internal <none> <none> $ oc -n openshift-machine-config-operator exec -c machine-config-daemon machine-config-daemon-ktlft -- openssl s_client -connect 10.0.50.248:9001 -servername 10.0.50.248 | openssl x509 -noout -text depth=1 CN = openshift-service-serving-signer@1764031164 verify error:num=19:self-signed certificate in certificate chain verify return:1 depth=1 CN = openshift-service-serving-signer@1764031164 verify return:1 depth=0 CN = kube-rbac-proxy.openshift-machine-config-operator.svc verify return:1 DONE Certificate: Data: Version: 3 (0x2) Serial Number: 2439036565063936562 (0x21d932d556d32a32) Signature Algorithm: sha256WithRSAEncryption Issuer: CN=openshift-service-serving-signer@1764031164 Validity Not Before: Nov 25 00:39:42 2025 GMT Not After : Nov 25 00:39:43 2027 GMT Subject: CN=kube-rbac-proxy.openshift-machine-config-operator.svc Subject Public Key Info: Public Key Algorithm: rsaEncryption Public-Key: (2048 bit) ... Exponent: 65537 (0x10001) X509v3 extensions: X509v3 Key Usage: critical Digital Signature, Key Encipherment X509v3 Extended Key Usage: TLS Web Server Authentication X509v3 Basic Constraints: critical CA:FALSE X509v3 Subject Key Identifier: C1:B5:E1:AF:F3:2F:43:6F:75:4F:3C:48:B0:44:55:FF:9C:64:B0:0C X509v3 Authority Key Identifier: 3B:D0:50:C1:5C:76:C1:FE:F5:1C:F5:53:E3:14:2F:65:68:B7:44:B2 X509v3 Subject Alternative Name: DNS:kube-rbac-proxy.openshift-machine-config-operator.svc, DNS:kube-rbac-proxy.openshift-machine-config-operator.svc.cluster.local 1.3.6.1.4.1.2312.17.100.2.1: .$597ff7b2-a9cb-4af2-ad34-2b7da12df615
this is a 4.21 regression issue, no issue for 4.20
Version-Release number of selected component (if applicable):
4.21
How reproducible:
always
Steps to Reproduce:
1. check targets status
Actual results:
all 4.21 machine-config-daemon targets are down
Expected results:
should be up
Additional info: