<!-- Mail as an attachment to: monthly@freebsd.org -->
<project cat='kern'>
  <title>Kernel Vnode Cache Tuning</title>

  <contact>
    <person>
      <name>
        <given>Kirk</given>
        <common>McKusick</common>
      </name>
      <email>mckusick@mckusick.com</email>
    </person>
    <person>
      <name>
        <given>Bruce</given>
        <common>Evans</common>
      </name>
      <email>bde@freebsd.org</email>
    </person>
    <person>
      <name>
        <given>Konstantin</given>
        <common>Belousov</common>
      </name>
      <email>kib@freebsd.org</email>
    </person>
    <person>
      <name>
        <given>Peter</given>
        <common>Holm</common>
      </name>
      <email>pho@freebsd.org</email>
    </person>
    <person>
      <name>
        <given>Mateusz</given>
        <common>Guzik</common>
      </name>
      <email>mjg@freebsd.org</email>
    </person>
  </contact>

  <links>
    <url href="  https://reviews.freebsd.org/rS292895">MFC to 10</url>
  </links>

  <body>
    <p>
      <p>
      This project has been completed.
      <p>
      This project includes changes to better manage the vnode freelist and to streamline the allocation and freeing of vnodes.
      <p>
      Rework the vnode cache recycling to meet free and unused vnodes targets.  Free vnodes are rarely completely free, but are just ones that are cheap to recycle.  Usually they are for files which have been stat'd but not read; these usually have inode and namecache data attached to them.  This target is the preferred minimum size of a sub-cache consisting mostly of such files. The system balances the size of this sub-cache with its complement to try to prevent either from thrashing while the other is relatively inactive.  The targets express a preference for the best balance.
      <p>
      "Above" this target there are 2 further targets (watermarks) related to recyling of free vnodes.  In the best-operating case, the cache is exactly full, the free list has size between vlowat and vhiwat above the free target, and recycling from it and normal use maintains this state.  Sometimes the free list is below vlowat or even empty, but this state is even better for immediate use provided the cache is not full.  Otherwise, vnlru_proc() runs to reclaim enough vnodes (usually non-free ones) to reach one of these states.  The watermarks are currently hard-coded as 4% and 9% of the available space higher.  These and the default of 25% for wantfreevnodes are too large if the memory size is large.  E.g., 9% of 75% of MAXVNODES is more than 566000 vnodes to reclaim whenever vnlru_proc() becomes active.
      <p>
      The vfs.vlru_alloc_cache_src sysctl is removed.  New code frees namecache sources as the last chance to satisfy the highest watermark, instead of selecting the source vnodes randomly. This provides good enough behaviour to keep vn_fullpath() working in most situations.  The filesystem layout with deep trees, where the removed knob was required, is thus handled automatically.
      <p>
      As the kernel allocates and frees vnodes, it fully initializes them on every allocation and fully releases them on every free.  These are not trivial costs: it starts by zeroing a large structure then initializes a mutex, a lock manager lock, an rw lock, four lists, and six pointers. And looking at vfs.vnodes_created, these operations are being done millions of times an hour on a busy machine.
      <p>
      As a performance optimization, this code update uses the uma_init and uma_fini routines to do these initializations and cleanups only as the vnodes enter and leave the vnode_zone. With this change the initializations are only done kern.maxvnodes times at system startup and then only rarely again. The frees are done only if the vnode_zone shrinks which never happens in practice. For those curious about the avoided work, look at the vnode_init() and vnode_fini() functions in sys/kern/vfs_subr.c to see the code that has been removed from the main vnode allocation/free path.
    </p>
  </body>

  <help></help>
</project>
