Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-8900

OCP upgrade from 4.6 to 4.7.9 breaks metering operator

XMLWordPrintable

    • Quality / Stability / Reliability
    • None
    • None
    • None
    • Important
    • None
    • All
    • None
    • None
    • Rejected
    • None
    • None
    • If docs needed, set a value
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      Upgrading OCP from 4.6-4.7.9, breaks metering operator and the following errors are seen in the logs.

      ~~~
      time="2021-06-18T04:00:04Z" level=error msg="error syncing ReportDataSource \"openshift-metering/pod-limit-cpu-cores\", adding back to queue" ReportDataSource=openshift-metering/pod-limit-cpu-cores app=metering component=reportDataSourceWorker error="ImportFromLastTimestamp errored: failed to store Prometheus metrics into table hive.metering.datasource_openshift_metering_pod_limit_cpu_cores for the range 2021-06-14 06:06:00 +0000 UTC to 2021-06-14 06:11:00 +0000 UTC: failed to store metrics into presto: presto SQL error: presto: query failed (200 OK): \"io.prestosql.spi.PrestoException: Error moving data files from file:/tmp/presto-reporting-operator/6b9c7de4-9173-484e-a773-5b06ac984b6e/dt=2021-06-14/20210618_040000_00106_59kig_0fad00d3-9a2f-4168-a308-5e9c0890cf2b to final location file:/user/hive/warehouse/metering.db/datasource_openshift_metering_pod_limit_cpu_cores/dt=2021-06-14/20210618_040000_00106_59kig_0fad00d3-9a2f-4168-a308-5e9c0890cf2b\"" logID=ArDObTmQql

      time="2021-06-18T04:00:04Z" level=info msg="syncing ReportDataSource openshift-metering/pod-persistentvolumeclaim-request-info" app=metering component=reportDataSourceWorker logID=7XZEGZsiv6
      time="2021-06-18T04:00:04Z" level=info msg="existing Prometheus ReportDataSource discovered, tableName: hive.metering.datasource_openshift_metering_pod_persistentvolumeclaim_request_info" app=metering component=reportDataSourceWorker logID=7XZEGZsiv6 namespace=openshift-metering reportDataSource=pod-persistentvolumeclaim-request-info
      time="2021-06-18T04:00:04Z" level=warning msg="time range 2021-06-17 06:27:00 +0000 UTC to 2021-06-18 04:00:04.821440675 +0000 UTC exceeds PrometheusImporter MaxQueryRangeDuration 10m0s, newEndTime: 2021-06-17 06:37:00 +0000 UTC" app=metering chunkSize=5m0s component=PrometheusImporter logID=riVZjeU3EG namespace=openshift-metering reportDataSource=pod-persistentvolumeclaim-request-info stepSize=1m0s tableName=hive.metering.datasource_openshift_metering_pod_persistentvolumeclaim_request_info

      time="2021-06-18T04:00:04Z" level=info msg="Event(v1.ObjectReference

      {Kind:\"ReportDataSource\", Namespace:\"openshift-metering\", Name:\"pod-limit-cpu-cores\", UID:\"480328ef-71c2-493d-8544-48d414a6a04b\", APIVersion:\"metering.openshift.io/v1\", ResourceVersion:\"642114040\", FieldPath:\"\"}

      ): type: 'Warning' reason: 'FailedPrometheusQuery' Unable to import metrics after Prometheus query failure. Check the reporting-operator container logs for more information." app=metering
      time="2021-06-18T04:00:06Z" level=info msg="stored a total of 138 metrics for data between 2021-06-17 06:27:00 +0000 UTC and 2021-06-17 06:32:00 +0000 UTC into hive.metering.datasource_openshift_metering_pod_persistentvolumeclaim_request_info" app=metering chunkSize=5m0s component=PrometheusImporter endTime="2021-06-17 06:32:00 +0000 UTC" logID=riVZjeU3EG namespace=openshift-metering reportDataSource=pod-persistentvolumeclaim-request-info startTime="2021-06-17 06:27:00 +0000 UTC" stepSize=1m0s tableName=hive.metering.datasource_openshift_metering_pod_persistentvolumeclaim_request_info
      ~~~

      ~~~
      reporting-operator-6f88d997c8-q5n54.log:time="2021-06-18T03:58:14Z" level=warning msg="Prometheus metrics import backlog detected: imported data for Prometheus ReportDataSource pod-persistentvolumeclaim-request-info newest imported metric timestamp 2021-06-17 06:26:00 +0000 UTC is 21h32m14.244430636s away, queuing to reprocess in 5.761902623s" app=metering component=reportDataSourceWorker logID=lVEHrglafs namespace=openshift-metering reportDataSource=pod-persistentvolumeclaim-request-info
      reporting-operator-6f88d997c8-q5n54.log:time="2021-06-18T03:58:20Z" level=warning msg="Prometheus metrics import backlog detected: imported data for Prometheus ReportDataSource persistentvolumeclaim-capacity-bytes newest imported metric timestamp 2021-06-17 05:10:00 +0000 UTC is 22h48m20.269181077s away, queuing to reprocess in 8.152080853s" app=metering component=reportDataSourceWorker logID=b9a1ELsB3f namespace=openshift-metering reportDataSource=persistentvolumeclaim-capacity-bytes
      reporting-operator-6f88d997c8-q5n54.log:time="2021-06-18T04:00:06Z" level=warning msg="Prometheus metrics import backlog detected: imported data for Prometheus ReportDataSource pod-persistentvolumeclaim-request-info newest imported metric timestamp 2021-06-17 06:32:00 +0000 UTC is 21h28m6.105762965s away, queuing to reprocess in 12.163683749s" app=metering component=reportDataSourceWorker logID=7XZEGZsiv6 namespace=openshift-metering reportDataSource=pod-persistentvolumeclaim-request-info
      reporting-operator-6f88d997c8-q5n54.log:time="2021-06-18T04:00:07Z" level=warning msg="Prometheus metrics import backlog detected: imported data for Prometheus ReportDataSource persistentvolumeclaim-phase newest imported metric timestamp 2021-06-17 08:57:00 +0000 UTC is 19h3m7.570262894s away, queuing to reprocess in 5.128259091s" app=metering component=reportDataSourceWorker logID=Z2rNCzQNrX namespace=openshift-metering reportDataSource=persistentvolumeclaim-phase
      ~~~

      Since the start of the OCP upgrade time, reports making using of the metering operator do not work and mention "data is unavailable for the specific period" . A one off report might work when the date is changed to current date but not for all the datasources.

      The same issue has been observed in 3 different clusters immediately after the upgrade.

      The only way to fix the issue is reinstall the operator with a clean PV/PVC because reinstaling using the same would complain about metering database already existing.

      All the reports fail with "ReportingPeriodUnmetDependencies" and this is only seen after the OCP upgrade is initiated and the data availability also matches with the OCP upgrade start time.

      Version-Release number of selected component (if applicable):

      metering-operator.4.7.0-202104250659.p0
      OCP 4.7.9

              openshift_jira_bot OpenShift Jira Bot
              openshift_jira_bot OpenShift Jira Bot
              None
              None
              Peter Ruan Peter Ruan
              None
              Red Hat Employee
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: