We have had problems with the POST /metrics/metrics/stats/query endpoint for as long as it has been used in OpenShift. Yesterday I was investigating an OCP 3.6 cluster, and I saw this in the logs:
For reference, here are a couple examples of the tags parameter in these requests:
I think that the cluster may have been a decent size with more than 5k pods, but this particular project only had 17 pods. I dug a bit deeper and found that the tags queries executed returned result set with more than 300k rows. With default page size of 1000 for the Cassandra driver, we are looking at 30+ round trips to and from Cassandra for each tag query.
We virtually have no visibility into what kind of data in terms of result set size with which we are dealing outside of manual inspection like I did.
I want to log a DEBUG message for each tag query that includes the tag's key and value(s) and the result set size. We can establish some threshold based on the configured page size for the Cassandra driver to instead log a similar message at INFO or at WARN. For example, if the page size is 1000 and if the threshold is 10, then a result set with 300,000 rows should trigger the INFO or WARN message.