Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.15, 4.16
Component/s: Machine Config Operator
Labels:
- mco-triaged

Test Coverage:

+
Severity:
Important
Regression:
No
Sprint:
MCO Sprint 254
sprint_count:
1
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
RH Private Keywords:
Escape Reason:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

Description of problem:


When we configure a cloudCA it takes 10 to 15 minutes to write the file in the nodes.

We have seen this behaviour in clusters with no enabled capabilities (for example: periodic-ci-openshift-openshift-tests-private-release-4.16-multi-nightly-aws-upi-baselinecaps-none-amd-f28-destructive)

$ oc get clusterversion -o yaml 
....
    capabilities:
      enabledCapabilities:
      - CloudCredential
      knownCapabilities:
      - Build
      - CSISnapshot
      - CloudCredential
      - Console
      - DeploymentConfig
      - ImageRegistry
      - Insights
      - MachineAPI
      - NodeTuning
      - OperatorLifecycleManager
      - Storage
      - baremetal
      - marketplace
      - openshift-samples

Version-Release number of selected component (if applicable):

]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.nightly-2024-02-09-073541   True        False         23m     Cluster version is 4.16.0-0.nightly-2024-02-09-073541

How reproducible:

Very often. It can happen that once in a while the file is added correctly, but if we remove the file from the nodes manually it will be reproduced.

It is rare that the cloudCA cert is correctly added, but it can happen.

Steps to Reproduce:

    1. Install a cluster with no capabilites

We have  seen this behaviour in prow jobs:
periodic-ci-openshift-openshift-tests-private-release-4.16-multi-nightly-aws-upi-baselinecaps-none-amd-f28-destructive

We have seen in flexy-install clusters installed with these options:

TEMPLATE: private-templates/functionality-testing/aos-4_16/upi-on-gcp/versioned-installer

LAUNCHER_VARS:
installer_payload_image:  registry.ci.openshift.org/ocp/release:4.16.0-0.nightly-2024-02-09-073541
baselineCapabilitySet: None
additionalEnabledCapabilities: ["CloudCredential"]
disable_worker_machineset: "yes"
launch_extra_worker_num: 3


    2. Add a cloudCA certificate to the cluster

$ openssl genrsa -out privateKey.pem 4096
$ openssl req -new -x509 -nodes -days 3600 -key privateKey.pem -out ca-bundle.crt -subj "/OU=MCO qe/CN=example.com"
$ oc set data -n openshift-config ConfigMap cloud-provider-config  --from-file=ca-bundle.pem=ca-bundle.crt


    3. Wait for the certificate to be writen in the nodes

$  oc debug -q  node/$(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") -- chroot /host cat "/etc/kubernetes/static-pod-resources/configmaps/cloud-config/ca-bundle.pem"

Actual results:

it will take 10 to 15 minutes to write the file in the nodes.

Expected results:

10-15 minutes is too much time to syn controllerconfig and write the files in the nodes, the file should be created earlier.

Additional info:


If we increase the verbosity of the logs, we can see this message in the MCDs:

I0208 16:47:50.760738   61728 certificate_writer.go:73] Error syncing ControllerConfig machine-config-controller (retries 0): open /etc/docker/certs.d: no such file or directory
I0208 16:47:50.760752   61728 daemon.go:2186] Updating Node ip-10-0-51-14.ec2.internal
I0208 16:47:50.765933   61728 certificate_writer.go:79] Started syncing ControllerConfig "machine-config-controller" (2024-02-08 16:47:50.765924397 +0000 UTC m=+60.414594060)
I0208 16:47:50.768956   61728 certificate_writer.go:81] Finished syncing ControllerConfig "machine-config-controller" (3.021865ms)


It is likely related to https://issues.redhat.com/browse/OCPBUGS-20152 and it will likely be fixed too when OCPBUGS-20152 is fixed.

Nevertheless, we need to verify it before closing this issue to make sure that it is like that.

is related to

OCPBUGS-20152 Nodes being marked degraded due to /etc/docker/certs.d not being found

Closed

OCPBUGS-33418 Investigate timing issues in machine-config-controller

Closed

relates to

OCPBUGS-33412 Nodes being marked degraded due to /etc/docker/certs.d not being found

Closed

Assignee:: Urvashi Mohnani

Reporter:: Sergio Regidor de la Rosa

QA Contact:: Sergio Regidor de la Rosa

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2024/02/09 10:33 AM

Updated:: 2024/12/13 4:43 AM

Resolved:: 2024/05/22 5:25 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates