Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-13020

Node reboots in 4.14 taking twice as long to complete due to promtail need to be SIGKILL'ed

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Critical Critical
    • None
    • 4.14
    • Test Framework
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • No
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      In MCO CI, our e2e-gcp-op has been blocked for a week or so: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
      
      Looking into the individual tests, we are taking 2x to 2.5x as long to run any tests:
      Successful runs from before:
      https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3441/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1649034037110509568/artifacts/e2e-gcp-op/test/build-log.txt
      https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3598/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1648916567469068288/artifacts/e2e-gcp-op/test/build-log.txt
      https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3682/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1648771501777752064/artifacts/e2e-gcp-op/test/build-log.txt
      
      Failing runs:
      https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3663/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1652001767627427840/artifacts/e2e-gcp-op/test/build-log.txt
      https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3505/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1651718542346686464/artifacts/e2e-gcp-op/test/build-log.txt
      https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3676/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/1652000289886048256/artifacts/e2e-gcp-op/test/build-log.txt
      
      Note that each test, although successful, now takes 2x to 2.5x the amount of time, and as such we hit a global timeout of 2hr30mins.
      
      I brought up a cluster on the latest successful nightly: 4.14.0-0.nightly-2023-05-01-124309
      
      And observed a test run. During the node shutdown, the pod: openshift-e2e-loki/loki-promtail-xxx/promtail needs to be SIGKILL'ed, causing the reboot to go from ~2mins to ~4 mins. I hope Test Framework is the right component for this pod?
      
      Log snippet from journalctl:
      May 02 16:30:42 ci-ln-9pxrvct-72292-wmpkf-worker-b-cw2lb systemd[1]: crio-add09cca3149cd8d19feb07b09f4d9a7b6c19bda976d500b707052d935eb215b.scope: Stopping timed out. Killing.
      May 02 16:30:42 ci-ln-9pxrvct-72292-wmpkf-worker-b-cw2lb systemd[1]: crio-add09cca3149cd8d19feb07b09f4d9a7b6c19bda976d500b707052d935eb215b.scope: Killing process 6196 (promtail) with signal SIGKILL.
      May 02 16:30:42 ci-ln-9pxrvct-72292-wmpkf-worker-b-cw2lb conmon[6109]: conmon add09cca3149cd8d19fe <ninfo>: container 6196 exited with status 137
      May 02 16:30:42 ci-ln-9pxrvct-72292-wmpkf-worker-b-cw2lb systemd[1]: crio-add09cca3149cd8d19feb07b09f4d9a7b6c19bda976d500b707052d935eb215b.scope: Failed with result 'timeout'.
      
      I captured a sosreport on a node after the reboot:
      https://drive.google.com/file/d/1cu7s87EniEDbGosKfp1U5k2ztf2T_S4Y/view?usp=share_link

      Version-Release number of selected component (if applicable):

      4.14

      How reproducible:

      Always

      Steps to Reproduce:

      1. Spin up a 4.14 nightly from the most recent week or so
      2. Run a MachineConfig update
      3. Watching journal logs
      

      Actual results:

      promtail is SIGKILL'ed

      Expected results:

      promtail does not require SIGKILL'ing to terminate

      Additional info:

      Marked critical since this is blocking our CI. Please move to the correct component if this isn't the right place

              rhn-engineering-dgoodwin Devan Goodwin
              jerzhang@redhat.com Yu Qi Zhang
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: