Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-25715

Tune worker processes number inside the pod service depending on the scale

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • None
    • cinder-operator
    • None
    • Important

      To Reproduce Steps to reproduce the behavior:

      This use case was shared by the customer:
      We have encountered multiple times a situation, where compute expensive api calls are taking service down. This happened for cinder-api, nova-api and nova-metadata.
       
      What we observed is that in this case pod scaling doesn’t help. At least we were still facing service deterioration with 21 replicas of nova-api, which is probably not the best approach.
       
      Most, if not all the services have 2 processes for WSGI and 1 worker on the api (snippets from nova operator below):
       
        ## In general we want nova-api to scale via k8s replicas but we need
        ## two processes per replica to always has a room for a healthecheck query
        WSGIDaemonProcess {{ $endpt }} display-name={{ $endpt }} processes=2 threads=1 user=nova group=nova
       
      if eq .service_name "nova-api"
      # scaling should be done by running more pods
      osapi_compute_workers=1
       
      When the single worker is busy handling query it can fail the health check, which we also observed. So not only we are facing the problem with service responding slowly due to load, but also k8s killing the pods making the situation even worse.
       
      After these observations we believe that these settings should be adjusted for bigger environments as ours. I don’t think that having more than 10 replicas is a valid pattern. Especially when in the end we consume more resources, instead of doing more work in single pods.
      Expected behavior

      • Appropriate configuration of Worker processes to digest the load in case of a large scale deployment.

      Device Info (please complete the following information):

      • RHOSO 18.0.7
      • Distributed Zones architecture, with 3 AZs, 3 Cells
      • 900+ Compute nodes
      • Stretched control plane

      Bug impact

      • In case of large number of requests, service slowness due to the load, and k8s killing the pods.

              fpantano@redhat.com Francesco Pantano
              smsallem@redhat.com Soumaya Msallem
              rhos-storage-integration
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: