Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 31 Dec 2016 21:41:31 +0200
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Josh Paetzel <jpaetzel@FreeBSD.org>
Cc:        freebsd-fs@freebsd.org, Rick Macklem <rmacklem@uoguelph.ca>, ash@ixsystems.com
Subject:   Re: NFS readdirplus on ZFS with > 1 billion files
Message-ID:  <20161231194131.GC1923@kib.kiev.ua>
In-Reply-To: <1483207716.3465220.833841385.061386FF@webmail.messagingengine.com>
References:  <1483179971.3381747.833629401.5EF242B8@webmail.messagingengine.com> <20161231133350.GU1923@kib.kiev.ua> <1483207716.3465220.833841385.061386FF@webmail.messagingengine.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Dec 31, 2016 at 12:08:36PM -0600, Josh Paetzel wrote:
> 
> 
> On Sat, Dec 31, 2016, at 07:33 AM, Konstantin Belousov wrote:
> > On Sat, Dec 31, 2016 at 04:26:11AM -0600, Josh Paetzel wrote:
> > > We've been chasing this bug for a very long time and finally managed to
> > > pin it down.  When a ZFS dataset has more than 1 billion files on it and
> > > an NFS client does a readdirplus the file handles for files with high
> > > znode/inode numbers gets truncated due to a 64 -> 32 bit conversion.
> > > 
> > > https://reviews.freebsd.org/D9009
> > > 
> > > This isn't a fix so much as a workaround.  From a performance standpoint
> > > it's the same as if the client mounts with noreaddirplus; sometimes it's
> > > a win, sometimes it's a lose.  CPU usage does go up on the server a bit.
> > > 
> > 
> > Can you point to the places in ZFS code where the truncation occur ?
> > I have no idea about ZFS code, and my question is mainly is the
> > truncation
> > just occurs due to different types of ino_t and zfs node id, or some code
> > actively does the range reduction.
> > 
> > My question is in the context of the long-dragging ino64 work, which
> > might
> > be finished in some visible future.  In particular, I am curious if just
> > using the patched kernel fixes your issue.  See
> > https://github.com/FreeBSDFoundation/freebsd/tree/ino64
> > although I do not make any claim about the state of the code yet.
> > 
> > Your patch, after a review, might be still useful for stable/10 and 11,
> > since I do not think that ino64 has any bits which could be merged.
> > _______________________________________________
> > freebsd-fs@freebsd.org mailing list
> > https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
> 
> That's a great question and I will attempt to answer the best I can,
> however I am cc'ing Ash Gokhale and Rick Macklem here because they
> understand the issue better and might be able to provide a better
> answer.
> 
> My understanding is the issue occurs here:
> 
> http://fxr.watson.org/fxr/source/fs/nfsserver/nfs_nfsdport.c?v=FREEBSD10#L2090
> 
> This codepath casts dirent d->fileno from 32 to 64bits to stuff the nfs
> fileno, but the legacy struct dirent->d_fileno is still 32 bit.
> 
> I'm not entirely sure this is a ZFS specific issue at all, I've never
> tried to put 1 billion files on a UFS filesystem to see what would
> happen. (I suspect this issue with the NFS server would be the least of
> your issues)
UFS2 inode number is 32bit.  If by billion you mean 10^12, you cannot put
that many files on UFS volume.

> 
> I agree the correct solution is the ino64 work.  I'm fine with this hack
> going directly in to 11-STABLE and 10-STABLE. (In fact I think that's
> the best solution)
All commits should go into HEAD first.  I doubt that ino64 could land into
HEAD earlier than in a month (but >= 2-3 months is less strain in estimation,
IMO).

> 
> Another thing we kicked around was making this hack a sysctl, such that
> you could manually activate it if a filesystem went over the threshold
> for the bug to occur.  No one is completely convinced we understand
> fully the performance implications of this patch.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20161231194131.GC1923>