Uploaded image for project: 'OpenShift Hosted Control Plane'
  1. OpenShift Hosted Control Plane
  2. HOSTEDCP-947

Hosted etcd running out of space on PVC after scale test

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Normal
    • None
    • None
    • None
    • False
    • None
    • False
    • Hypershift Sprint 236
    • 0
    • 0
    • 0

    Description

      After running several scale tests on a large cluster (252 workers), etcd ran out of space and became unavailable.

       

      These tests consisted of running our node-density workload (Creates more than 50k pause pods) and cluster-density 4k several times (creates 4k namespaces with https://github.com/cloud-bulldozer/e2e-benchmarking/tree/master/workloads/kube-burner#cluster-density-variables).

       

      The actions above leaded etcd peers to run out of free space in their 4GiB PVCs presenting the following error trace

      {"level":"warn","ts":"2023-03-31T09:50:57.532Z","caller":"rafthttp/http.go:271","msg":"failed to save incoming database snapshot","local-member-id":"b14198cd7f0eebf1","remote-snapshot-sender-id":"a4e894c3f4af1379","incoming-snapshot-index ":19490191,"error":"write /var/lib/data/member/snap/tmp774311312: no space left on device"} 

       

      Etcd uses 4GiB PVCs to store its data, which seems to be insufficient for this scenario. In addition, unlike not-hypershift clusters we're not applying any periodic database defragmentation (this is done by cluster-etcd-operator) that could lead to a higher database size

       

      The graph below represents the metrics etcd_mvcc_db_total_size_in_bytes and etcd_mvcc_db_total_size_in_use_in_byte

       

       

      Attachments

        Activity

          People

            jparrill@redhat.com Juan Manuel Parrilla Madrid
            rsevilla@redhat.com Raul Sevilla Canavate
            He Liu He Liu
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: