I have seen in a number of clusters that the DeleteExpiredMetrics job is causing OutOfMemoryErrors. I first observed this in larger clusters with more than 10k pods, but then I also saw it in more moderate sized clusters with around 3k pods. There does not appear to be a memory leak. The queries that the job executes for fetching metric definitions bring too much data at once which leads to a lot of GC pressure, huge cpu spikes, and eventually an OOME.
The solution I am proposing for this ticket is to altogether remove the job. This job was introduced in
HWKMETRICS-613. We do not have APIs for deleting metrics which is fine for the data table because everything that goes into the data table has a TTL. Rows inserted into index tables however, do not have TTLs. This has led to problems in OpenShift with really big partitions.
DeleteExpiredMetrics most important object was to prevent unbounded growth of our index tables. Even aside from the OOMEs, the job has never accomplished its primary object because there is a bug in some of the date arithmetic for calculating the timestamp when inserting rows into the metrics_expiration_idx table. Rows are inserted with timestamps that 1,655 years into the future. This is how long it will be before the job considers a metric eligible for deletion.
I would rather do away with the job and implement a solution that utilizes kubernetes watch APIs. That will be done later in a separate ticket.