Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 5 Feb 2015 16:16:55 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        bugzilla-noreply@freebsd.org
Cc:        freebsd-bugs@freebsd.org
Subject:   Re: [Bug 197336] find command cannot see more than 32765 subdirectories when using ZFS
Message-ID:  <20150205150044.M1011@besplex.bde.org>
In-Reply-To: <bug-197336-8@https.bugs.freebsd.org/bugzilla/>
References:  <bug-197336-8@https.bugs.freebsd.org/bugzilla/>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 4 Feb 2015 bugzilla-noreply@freebsd.org wrote:

> Created attachment 152566
>  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=152566&action=edit
> python script to generate a bunch of subdirectories with files in them

This may be considered a feature -- it detected a bad script that created
too many files

> When a directory has more than 32765 subdirectories in it, the find command
> fails to find all of the contents if the find command is executed in a ZFS
> filesystem.

FreeBSD only supports file systems that support at most 32767 links.  This
is mainly a problem for subdirectories, since each subdirectory has a ".."
link to the same parent.  It could support at most 65535 links, but this
would break API compatibility, but the API is already broken.  More than
that would break binary compatibility.  The limit of 65535 is from nlink_t
being 16 bits unsigned, and the limit of 32767 is a bug that has survived
for more than 20 years to keep the API bug for bug compatible.

There are many bugs in this support.  Most are in individual file systems.
At the top level, the only know bug is that LINK_MAX is defined at all
(as 32767).  Defining it means that the limit {LINK_MAX}, i.e.,
pathconf(path, _PC_LINK_MAX) is the same for all files on all file systems
systems, but FreeBSD supports many file systems with widely varying
{LINK_MAX}.

Some file systems actually implement {LINK_MAX} correctly as the limit
that applies to them:
- this is easy if it is <= LINK_MAX.  If it is < LINK_MAX, this is
   incompatible with the definition of LINK_MAX, but any software that is
   naive or broken enough to use LINK_MAX probably won't notice any problem.
- if it is > LINK_MAX but <= 65535, then returning the correct limit in
   pathconf() is again incompatible with LINK_MAX being smaller, and this
   now breaks the naive/broken software (e.g., arrays sized with LINK_MAX
   may be overrun).
- if it is > 65535, then FreeBSD cannot support the file system properly.
   However, if there are no files with more tha 65535 links at mount time,
   then it is good enough to maintain this invariant.  The python script
   should break trying to create the 65536th file in this case (even earlier
   on file systems with a smaller limit).  If there is just one file with
   more than 65535 links, then the file is not supported and perhaps all
   operations on it should fail, starting with stat() to see what it is,
   since stat() cannot return its correct link count and has no way of
   reporting this error except by failing the while syscall.

zfs apparently supports a large number of links, but has many bugs:
- in pathconf() it says that {LINK_MAX} is INT_MAX for all files
- in stat() (VOP_GETATTR()) it breaks this even for files with a link
   count between 32768 and 65535 inclusive, by clamping to LINK_MAX
   = 32767.
This inconsistency explains the behaviour seen.  The python script
might be sophisticated to a fault, but believe the broken {LINK_MAX}.
It might do fancy splitting of subdirectories to avoid hitting the
limit, but not do any splitting since the limit is large.  Then
find might be confused by stat() returning the clamped number of
links.  I suspect that the actual reasons are more complicated.  find
doesn't use link counts much directly, but it uses fts which probably
makes critical use of them.

nfs is much more broken than nfs here.  The server file system may
support anything for {LINK_MAX} and st_nlink.  nfs seems to blindly
assign the server values (except for {LINK_MAX} in the v2 case, it
invents a value).  So if {LINK_MAX} > 65535 on the server, the large
server value is normally returned (not truncated since rlim_t is large
enough for anything).  This matches the zfs behaviour of returning a
larger-than-possible value.  But if st_nlink > 65535 on the server,
it is blindly truncated to a value <= 65335 (possibly 0, but not
negative since nlink_t is signed.  Oops, va_nlink is still short, so
negative values occur too).  This is more dangerous than the clamping
in zfs.

nfs mostly uses va_nlink internally, and uses it it in a critical way
for at least the test (va_nlink > 1).  Truncation to a signed value
breaks this for all values that were between 32768 and 65535 before
truncation.  Truncation to an unsigned value would have only broken
it for 65536,  Similarly for all values equal mod 65536 (or 32768).

> If the same command is executed in another filesystem that FreeBSD supports
> that also supports large counts of subdirectories, the find command sees
> everything.  I've confirmed the correct behavior with both Reiserfs and
> unionfs.  So it appears to be something about the interaction between find and
> ZFS that triggers the bug.

It is impossible for the other file systems to work much better.  Perhaps
they work up to 65535, or have the correct {LINK_MAX} and the python script
is smart enough to avoid it.  I doubt that python messes with {LINK_MAX},
but creation of subdirectories should stop when the advertized limit is
hit, and python or the script should handle that, possibly just by
stopping.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150205150044.M1011>