Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-31250

alert for metrics endpoint at port 9537 shows connection refused for windows nodes

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Normal Normal
    • None
    • 4.13
    • Monitoring
    • Moderate
    • No
    • MON Sprint 255, MON Sprint 256
    • 2
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, if a connection on port 9637 for Windows nodes was refused, the Kubelet Service Monitor threw a `target down` alert because CRIO doesn't run on Windows nodes. With this release, Windows nodes are excluded from the Kubelet Service Monitor. (link:https://issues.redhat.com/browse/OCPBUGS-31250[*OCPBUGS-31250*])
      Show
      * Previously, if a connection on port 9637 for Windows nodes was refused, the Kubelet Service Monitor threw a `target down` alert because CRIO doesn't run on Windows nodes. With this release, Windows nodes are excluded from the Kubelet Service Monitor. (link: https://issues.redhat.com/browse/OCPBUGS-31250 [* OCPBUGS-31250 *])
    • Release Note Not Required
    • In Progress

      Description of problem:

      1. For the Linux nodes, the container runtime is CRI-O and the port 9537 has a crio process listening on it.While, windows nodes doesn't have CRIO container runtime.
      2. Prometheus is trying to connect to /metrics endpoint on the windows nodes on port 9537 which actually does not have any process listening on it. 
      3. TargetDown is alerting crio job since it cannot reach the endpoint http://windows-node-ip:9537/metrics.

      Version-Release number of selected component (if applicable):

          

      How reproducible:

          

      Steps to Reproduce:

          1. Install 4.13 cluster with windows operator
          2. In the Prometheus UI, go to > Status > Targets to know which targets are down.   
          

      Actual results:

          It gives the alert for targetDown

      Expected results:

          It should not give any such alert.

      Additional info:

          

            [OCPBUGS-31250] alert for metrics endpoint at port 9537 shows connection refused for windows nodes

            Errata Tool added a comment -

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory (Moderate: OpenShift Container Platform 4.17.0 bug fix and security update), and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHSA-2024:3718

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Moderate: OpenShift Container Platform 4.17.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:3718

            Jayapriya Pai added a comment - rhn-support-sar I have cherry-picked changes to 4.15 https://github.com/openshift/cluster-monitoring-operator/pull/2487  

            Santhiya R added a comment - - edited

            Hello janantha@redhat.com 

            Could you please let me know if there is a plan for this fix to OCP 4.15 and the lower version?

            Santhiya R added a comment - - edited Hello janantha@redhat.com   Could you please let me know if there is a plan for this fix to OCP 4.15 and the lower version?

            Junqi Zhao added a comment - - edited

            tested with 4.17.0-0.nightly-2024-07-07-131215, 2 windows workers in the cluster, the windows worker is excluded from kubelet servicemonitor

            $ oc get node -o wide
            NAME                                        STATUS   ROLES                  AGE     VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                KERNEL-VERSION                 CONTAINER-RUNTIME
            ip-10-0-0-187.us-east-2.compute.internal    Ready    worker                 3h21m   v1.30.2+421e90e   10.0.0.187    <none>        Red Hat Enterprise Linux CoreOS 417.94.202407050206-0   5.14.0-427.24.1.el9_4.x86_64   cri-o://1.30.3-2.rhaos4.17.git8750e76.el9
            ip-10-0-13-231.us-east-2.compute.internal   Ready    worker                 167m    v1.29.4+6c10b2d   10.0.13.231   <none>        Windows Server 2022 Datacenter                          10.0.20348.2527                containerd://1.7.16
            ip-10-0-24-0.us-east-2.compute.internal     Ready    control-plane,master   3h27m   v1.30.2+421e90e   10.0.24.0     <none>        Red Hat Enterprise Linux CoreOS 417.94.202407050206-0   5.14.0-427.24.1.el9_4.x86_64   cri-o://1.30.3-2.rhaos4.17.git8750e76.el9
            ip-10-0-25-179.us-east-2.compute.internal   Ready    worker                 171m    v1.29.4+6c10b2d   10.0.25.179   <none>        Windows Server 2022 Datacenter                          10.0.20348.2527                containerd://1.7.16
            ip-10-0-35-107.us-east-2.compute.internal   Ready    control-plane,master   3h27m   v1.30.2+421e90e   10.0.35.107   <none>        Red Hat Enterprise Linux CoreOS 417.94.202407050206-0   5.14.0-427.24.1.el9_4.x86_64   cri-o://1.30.3-2.rhaos4.17.git8750e76.el9
            ip-10-0-45-116.us-east-2.compute.internal   Ready    worker                 3h21m   v1.30.2+421e90e   10.0.45.116   <none>        Red Hat Enterprise Linux CoreOS 417.94.202407050206-0   5.14.0-427.24.1.el9_4.x86_64   cri-o://1.30.3-2.rhaos4.17.git8750e76.el9
            ip-10-0-68-104.us-east-2.compute.internal   Ready    control-plane,master   3h27m   v1.30.2+421e90e   10.0.68.104   <none>        Red Hat Enterprise Linux CoreOS 417.94.202407050206-0   5.14.0-427.24.1.el9_4.x86_64   cri-o://1.30.3-2.rhaos4.17.git8750e76.el9
            ip-10-0-70-205.us-east-2.compute.internal   Ready    worker                 3h22m   v1.30.2+421e90e   10.0.70.205   <none>        Red Hat Enterprise Linux CoreOS 417.94.202407050206-0   5.14.0-427.24.1.el9_4.x86_64   cri-o://1.30.3-2.rhaos4.17.git8750e76.el9
            
            $ token=`oc create token prometheus-k8s -n openshift-monitoring`
            # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/targets' | jq '.data.activeTargets[] | select(.scrapePool=="serviceMonitor/openshift-monitoring/kubelet/3")' | jq '{scrapePool: .scrapePool, scrapeUrl: .scrapeUrl, health: .health}'
            {
              "scrapePool": "serviceMonitor/openshift-monitoring/kubelet/3",
              "scrapeUrl": "https://10.0.0.187:9637/metrics",
              "health": "up"
            }
            {
              "scrapePool": "serviceMonitor/openshift-monitoring/kubelet/3",
              "scrapeUrl": "https://10.0.24.0:9637/metrics",
              "health": "up"
            }
            {
              "scrapePool": "serviceMonitor/openshift-monitoring/kubelet/3",
              "scrapeUrl": "https://10.0.35.107:9637/metrics",
              "health": "up"
            }
            {
              "scrapePool": "serviceMonitor/openshift-monitoring/kubelet/3",
              "scrapeUrl": "https://10.0.45.116:9637/metrics",
              "health": "up"
            }
            {
              "scrapePool": "serviceMonitor/openshift-monitoring/kubelet/3",
              "scrapeUrl": "https://10.0.68.104:9637/metrics",
              "health": "up"
            }
            {
              "scrapePool": "serviceMonitor/openshift-monitoring/kubelet/3",
              "scrapeUrl": "https://10.0.70.205:9637/metrics",
              "health": "up"
            }
            

            Junqi Zhao added a comment - - edited tested with 4.17.0-0.nightly-2024-07-07-131215, 2 windows workers in the cluster, the windows worker is excluded from kubelet servicemonitor $ oc get node -o wide NAME                                        STATUS   ROLES                  AGE     VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                KERNEL-VERSION                 CONTAINER-RUNTIME ip-10-0-0-187.us-east-2.compute.internal    Ready    worker                 3h21m   v1.30.2+421e90e   10.0.0.187    <none>        Red Hat Enterprise Linux CoreOS 417.94.202407050206-0   5.14.0-427.24.1.el9_4.x86_64   cri-o: //1.30.3-2.rhaos4.17.git8750e76.el9 ip-10-0-13-231.us-east-2.compute.internal   Ready    worker                 167m    v1.29.4+6c10b2d   10.0.13.231   <none>        Windows Server 2022 Datacenter                          10.0.20348.2527                containerd: //1.7.16 ip-10-0-24-0.us-east-2.compute.internal     Ready    control-plane,master   3h27m   v1.30.2+421e90e   10.0.24.0     <none>        Red Hat Enterprise Linux CoreOS 417.94.202407050206-0   5.14.0-427.24.1.el9_4.x86_64   cri-o: //1.30.3-2.rhaos4.17.git8750e76.el9 ip-10-0-25-179.us-east-2.compute.internal   Ready    worker                 171m    v1.29.4+6c10b2d   10.0.25.179   <none>        Windows Server 2022 Datacenter                          10.0.20348.2527                containerd: //1.7.16 ip-10-0-35-107.us-east-2.compute.internal   Ready    control-plane,master   3h27m   v1.30.2+421e90e   10.0.35.107   <none>        Red Hat Enterprise Linux CoreOS 417.94.202407050206-0   5.14.0-427.24.1.el9_4.x86_64   cri-o: //1.30.3-2.rhaos4.17.git8750e76.el9 ip-10-0-45-116.us-east-2.compute.internal   Ready    worker                 3h21m   v1.30.2+421e90e   10.0.45.116   <none>        Red Hat Enterprise Linux CoreOS 417.94.202407050206-0   5.14.0-427.24.1.el9_4.x86_64   cri-o: //1.30.3-2.rhaos4.17.git8750e76.el9 ip-10-0-68-104.us-east-2.compute.internal   Ready    control-plane,master   3h27m   v1.30.2+421e90e   10.0.68.104   <none>        Red Hat Enterprise Linux CoreOS 417.94.202407050206-0   5.14.0-427.24.1.el9_4.x86_64   cri-o: //1.30.3-2.rhaos4.17.git8750e76.el9 ip-10-0-70-205.us-east-2.compute.internal   Ready    worker                 3h22m   v1.30.2+421e90e   10.0.70.205   <none>        Red Hat Enterprise Linux CoreOS 417.94.202407050206-0   5.14.0-427.24.1.el9_4.x86_64   cri-o: //1.30.3-2.rhaos4.17.git8750e76.el9 $ token=`oc create token prometheus-k8s -n openshift-monitoring` # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https: //thanos-querier.openshift-monitoring.svc:9091/api/v1/targets' | jq '.data.activeTargets[] | select(.scrapePool== "serviceMonitor/openshift-monitoring/kubelet/3" )' | jq '{scrapePool: .scrapePool, scrapeUrl: .scrapeUrl, health: .health}' {   "scrapePool" : "serviceMonitor/openshift-monitoring/kubelet/3" ,   "scrapeUrl" : "https: //10.0.0.187:9637/metrics" ,   "health" : "up" } {   "scrapePool" : "serviceMonitor/openshift-monitoring/kubelet/3" ,   "scrapeUrl" : "https: //10.0.24.0:9637/metrics" ,   "health" : "up" } {   "scrapePool" : "serviceMonitor/openshift-monitoring/kubelet/3" ,   "scrapeUrl" : "https: //10.0.35.107:9637/metrics" ,   "health" : "up" } {   "scrapePool" : "serviceMonitor/openshift-monitoring/kubelet/3" ,   "scrapeUrl" : "https: //10.0.45.116:9637/metrics" ,   "health" : "up" } {   "scrapePool" : "serviceMonitor/openshift-monitoring/kubelet/3" ,   "scrapeUrl" : "https: //10.0.68.104:9637/metrics" ,   "health" : "up" } {   "scrapePool" : "serviceMonitor/openshift-monitoring/kubelet/3" ,   "scrapeUrl" : "https: //10.0.70.205:9637/metrics" ,   "health" : "up" }

            Hi janantha@redhat.com,

            Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

            OpenShift Jira Bot added a comment - Hi janantha@redhat.com , Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

            Ayoub Mrini added a comment - - edited

            I believe that in a perfect world, this should be handled on the Endpoints/Service side. It’s confusing to me that a kubelet is part of an Endpoints/Service but cannot serve all or any of its ports. I think the windows kubelet should be removed from that Endpoints/Service (or put in different Services/Endpoints or Endpoints subsets...), as this will continue to confuse other users.

            Ayoub Mrini added a comment - - edited I believe that in a perfect world, this should be handled on the Endpoints/Service side. It’s confusing to me that a kubelet is part of an Endpoints/Service but cannot serve all or any of its ports. I think the windows kubelet should be removed from that Endpoints/Service (or put in different Services/Endpoints or Endpoints subsets...), as this will continue to confuse other users.

            Junqi Zhao added a comment -

            rh-ee-adpawar thanks, will wait for the developer to fix this issue

            Junqi Zhao added a comment - rh-ee-adpawar thanks, will wait for the developer to fix this issue

            Hello juzhao@redhat.com Thanks for sharing the thread and Bug. I think there is no justification provided on the bug. Are we good to consider this bug and work on it? Just making sure we don't have any blocker to work on this. 
            Feel free to update If any logs are needed. Thanks

            Aditya Pawar (Inactive) added a comment - Hello juzhao@redhat.com Thanks for sharing the thread and Bug. I think there is no justification provided on the bug. Are we good to consider this bug and work on it? Just making sure we don't have any blocker to work on this.  Feel free to update If any logs are needed. Thanks

            Ranjith Rajaram added a comment - - edited

            Yes not just port 9537, we also have port kubelet service monitor port 10250 also listed. Sharing a screenshot from the test cluster

            Will we include this fix in an upcoming release ?

             

            Ranjith Rajaram added a comment - - edited Yes not just port 9537, we also have port kubelet service monitor port 10250 also listed. Sharing a screenshot from the test cluster Will we include this fix in an upcoming release ?  

            For the record, we could keep only the Linux nodes (and nodes without the OS label) when scraping kubelet/crio endpoints using this snippet for the kubelet service monitor:

            spec:
             attachMetadata:
              node: true
              endpoints:
              - ...
                relabelings:
                - action: keep
                  regex: (linux|)
                  sourceLabels:
                  - __meta_kubernetes_node_label_kubernetes_io_os
            

            To make it work, the prometheus-k8s service account needs to be granted the list, get, watch permissions on the Node resource.

            Simon Pasquier added a comment - For the record, we could keep only the Linux nodes (and nodes without the OS label) when scraping kubelet/crio endpoints using this snippet for the kubelet service monitor: spec: attachMetadata: node: true endpoints: - ...     relabelings:     - action: keep       regex: (linux|)       sourceLabels:       - __meta_kubernetes_node_label_kubernetes_io_os To make it work, the prometheus-k8s service account needs to be granted the list, get, watch permissions on the Node resource.

              janantha@redhat.com Jayapriya Pai
              rh-ee-adpawar Aditya Pawar (Inactive)
              Junqi Zhao Junqi Zhao
              Eliska Romanova Eliska Romanova
              Simon Pasquier
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

                Created:
                Updated:
                Resolved: