Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-1386

cleanup/prune stale data in /var/lib/NetworkManager (e.g. DHCP lease files)

    • sst_network_management
    • ssg_networking
    • False
    • Hide

      None

      Show
      None
    • If docs needed, set a value

      /var/lib/NetworkManager contains data, that does not get garbage collected. That is a problem, in particular when lots of new profiles/devices get created, and we leak DHCP lease files. etc.

      O) in the past, /var/lib/NetworkManager/seen-bssids and /var/lib/NetworkManager/bssids had similar problems. This was solved by [1]. Maybe get inspired. But note that there the leaking happened inside one file. This bug is mostly about leaking while files files.

      [1] https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/commit/26f38b9ffafeab145150332f8a11eb94559c31bf

      I) the lease files "/var/lib/NetworkManager/*.lease". We write these files per-connection and device, so they can pile up. We need to regularly clean files that:

      1) refer to a connection UUID which no longer exists

      2) it's harder to clean them up for unused devices. But maybe not too hard, and we can differentiate between the cases based on the `connection.type` of the profile:

      2a) if the profile specifies "connection.interface-name", then we can delete all leases that are for different interface-names. That's easy and correct.

      2b) otherwise, if the lease is for a hardware device (ethernet), then we can delete it if the interface currently does not exist. That is problematic, because if the user unplugs the USB device, we would delete the lease. 4) will mitigate that problem.

      2c) otherwise, see the timebase cleanup below (3).

      3) leases have a timestamp when we received them. Well, actually we no longer store the timestamp in the lease itself, but we could interpret the file timestamp for that. We also know the lease duration – although, again, we don't store that in the "intern*.lease" files. Anyway, it means, maybe we could also prune files that are older than a certain time. This would require the system clock accurate, which may not be the case. So only do this step, if we have too many files (2000?) that are still in use.

      4) together with the previous rules, we can also keep an excess of e.g. 100 extra files. The goal is to not have an unlimited amount of stale files, but we are fine to keep a few (100). So, from all the candidates above that we would want to delete, sort them by the file timestamp, and keep the 100 least recently used. This should cover the case where you just "temporarily" deleted a profile or unplugged a device, but when you replug it, the file is still there... unless, you have too many of such files.

      This means the timestamp of the lease files is important. Maybe make sure we touch them, when we activate/deactivate a profile.

      Some of the suggestions approach seem unsafe and we might delete still useful leases. Note that with the internal DHCP client, we almost don't use the lease file for anything. Except for the "requested-address" (see [2]). For dhclient plugin that might be different, but in general, it seems the lease file is not very important to us. And we might be able to delete them more aggressively.

      [2] https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/blob/ae0cc9618c49bb74bbe54a073dc337e9a3b0005b/src/core/dhcp/nm-dhcp-nettools.c#L1102

      II) we usually write files with `g_file_set_contents()` or `nm_utils_file_set_contents()`. This creates a temporary file (like `mktemp ".XXXXXX"`) and does an atomit replace. If we crash in the middle, we might leak such temporary files. We should regularly clean such files, based on the filename. For example, check the following patterns whether they are created using the mktemp approach and have this problem:

      • internal-*.lease.XXXXXX
      • dhclient-*.lease.XXXXXX
      • dnsmasq-*.leases.XXXXXX
      • NetworkManager.state.XXXXXX
      • timestamps.XXXXXX
      • seen-bssids.XXXXXX
      • NetworkManager-intern.conf.XXXXXX
      • no-auto-default.state.XXXXXX
      • ** maybe others? **

      III) do something about /var/lib/NetworkManager/dnsmasq-*.leases files. Probably works similar as lease files.

      IV) check whether /var/lib/NetworkManager/no-auto-default.state has a problem. I think it doesn't.

      What should happen, that we have a function prune_var_lib(), which iterates over all files in the directory, classifies them based on the well-known filenames, and checks whether they should be cleaned.

      then, ensure that we call this function at least once (e.g. during startup). And after every X writes of stale files (so we need to count the writes, and trigger a prune periodically).

            nm-team Network Management Team
            thaller@redhat.com Thomas Haller
            Network Management Team Network Management Team
            Desktop QE Desktop QE
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: