Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 4 Aug 2015 11:56:00 +0800
From:      Julian Elischer <julian@freebsd.org>
To:        Rick Macklem <rmacklem@uoguelph.ca>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: changes to export whole FS hierarhy to mount it with one command on client? [changed subject]
Message-ID:  <55C037D0.1000606@freebsd.org>
In-Reply-To: <987522757.8576059.1438636467059.JavaMail.zimbra@uoguelph.ca>
References:  <795246861.20150801140429@serebryakov.spb.ru> <1363497421.7238055.1438428070047.JavaMail.zimbra@uoguelph.ca> <1593307781.20150801143052@serebryakov.spb.ru> <55BEE668.3080303@freebsd.org> <67101638.8226696.1438604713620.JavaMail.zimbra@uoguelph.ca> <55BFC58C.6030802@freebsd.org> <987522757.8576059.1438636467059.JavaMail.zimbra@uoguelph.ca>

next in thread | previous in thread | raw e-mail | index | archive | help
On 8/4/15 5:14 AM, Rick Macklem wrote:
> Julian Elischer wrote:
>> On 8/3/15 8:25 PM, Rick Macklem wrote:
>>> Julian Elischer wrote:
>>>> On 8/1/15 7:30 PM, Lev Serebryakov wrote:
>>>>> Hello Rick,
>>>>>
>>>>> Saturday, August 1, 2015, 2:21:10 PM, you wrote:
>>>>>
>>>>>> To mount multiple file systems as one mount, you'll need to use NFSv4. I
>>>>>> believe
>>>>>> you will have to have a separate export entry in the server for each of
>>>>>> the file
>>>>>> systems.
>>>>>     So, /etc/exports needs to have BOTH v3-style exports & V4: root of
>>>>>     tree
>>>>>     line?
>>>> OR you can have a non-standard patch that pjd wrote to do recursive
>>>> mounts of sub-filesystems.
>>>> it is not supposed to happen according to the standard but we have
>>>> found it useful.
>>>> Unfortnately it is written agains the old NFS Server.
>>>>
>>>> Rick, if I gave you the original pjd patch for the old server, could
>>>> you integrate it into the new server as an option?
>>>>
>>> A patch like this basically inserts the file system volume identifier
>>> in the high order bits of the fileid# (inode# if you prefer), so that
>>> duplicate fileid#s don't show up in a "consolidated file system" (for
>>> want of a better term). It also replies with the same "fake" fsid for
>>> all volumes involved.
>>>
>>> I see certain issues w.r.t. this:
>>> 1 - What happens when the exported volumes are disjoint and don't form
>>>       one tree? (I think any just option should be restricted to volumes
>>>       that form a tree, but I don't know an easy way to enforce that
>>>       restriction?)
>>> 2 - It would be fine at this point to use the high order bits of the
>>> fileid#,
>>>       since NFSv3 defines it as 64bits and FreeBSD's ino_t is 32bits.
>>>       However,
>>>       I believe FreeBSD is going to have to increase ino_t to 64bits soon.
>>>       (I hope such a patch will be in FreeBSD11.)
>>>       Once ino_t is 64bits, this option would have to assume that some # of
>>>       the high order bits of the fileid# are always 0. Something like
>>>       "the high order 24bits are always 0" would work ok for a while, then
>>>       someone would build a file system large enough to overflow the 40bit
>>>       (I know that's a lot, but some are already exceeding 32bits for # of
>>>        fileids) field and cause trouble.
>>> 3 - You could get weird behaviour when the tree includes exports with
>>> different
>>>       export options. This discussion includes just that and NFSv3 clients
>>>       don't expect things to change within a mount. (An example would be
>>>       having
>>>       part of this consolidated tree require Kerberos authentication.
>>>       Another
>>>       might be having parts of the consolidated tree use different uid
>>>       mapping
>>>       for AUTH_SYS.)
>>> 4 - Some file systems (msdosfs ie. FAT) have limited capabilities w.r.t.
>>> what
>>>       the NFS server can do to the file system. If one of these was imbedded
>>>       in
>>>       the consolidated tree, then it could cause confusion similar to #3.
>>>
>>> All in all, the "hack" is relatively easy to do, if:
>>> You use one kind of file system (for example ZFS) and make everything you
>>> are
>>> exporting one directory tree which is all exported in a compatible way.
>>> You also "know" that all the fileid#s in the underlying file systems will
>>> fit
>>> in the low order K bits of the 64bit fileid#.
>>>
>>> My biggest concern is #2, once ino_t becomes 64bits.
>>>
>>> If the collective thinks this is a good idea despite the issues above and
>>> can
>>> propose a good way to do it. (Maybe an export flag for all the volumes that
>>> will participate in the "consolidated file system"? The exports(5) man page
>>> could then try to clearly explain the limitations of its use, etc. Even
>>> with
>>> that, I suspect some would misuse the option and cause themselves grief.)
>>>
>>> Personally, since NFSv4 does this correctly, I don't see a need to "hack
>>> it"
>>> for NFSv3, but I'll leave it up to the collective.
>>>
>>> rick
>>> ps: Julian, you might want to repost this under a separate subject line, so
>>>       people not interested in how ZFS can export multiple volumes without
>>>       separate entries will read it.
>>>
>> In our environment we need to export V3 (and maybe even V2) in a
>> single hierarchy, even though it's multiple ZFS filesystems.
>> It's not dissimilar to having a separate ZFS for each user, except in
>> this case it's  a separate ZFS for each site.
>> The "modified ZFS" filesystems have very special characteristics. We
>> are only having our very first nibbles (questions) about NFSv4. Until
>> now it's all NFS3.   Possibly we'd only have to support it for NFSv3
>> if V4 can use its native mechanisms.
>>
>>
> Sure. You have a particular environment where it is useful and you understand
> how to use it in that situation. I could do it here in about 10minutes and would
> do so if I needed it myself. The trick is I understand what is going on and the
> limitations w.r.t. doing it.
>
> If you know your file systems are all in one directory hierarchy (tree), all are ZFS
> and none of them even generate fileid#s that don't fit in 32bits and you are exporting
> them all in the same way, it's pretty easy to do.
> Unfortunately, that isn't what generic NFS server support for FreeBSD does.
> (If this is done, I think it has to be somehow restricted to the above or at least
>   documented that it only works for the above cases.)
>
> Since an NFSv2 fileid# is 32bits, I don't believe this is practical for NFSv2
> and I don't think anyone would care. Since NFSv4 does this "out of the box",
> I think the question is whether or not it should be done for NFSv3?
>
> The challenge would be to put it in FreeBSD in a way that people who don't
> necessarily understand what is "behind the curtain" can use it effectively
> and not run into problems. (An example being the previous thread where the
> different file systems are being created with different characteristics for
> different users. That could be confusing if the sysadmin thought it was
> "one volume".)
> I'll leave whether or not to do this up to the collective. (This is yet another one
> of these "easy to code" but hard to tell if it the correct thing to do situations.)
> If done I'd suggest:
> - Restricted to one file system type (ZFS or UFS or ...). The code would probably
>    have file system specifics in it. The correct way to do this will be different for
>    ZFS than UFS, I think?
> - A check for any fileid# that has high order bits set that would syslog an error.
> - Enabled by an export option, so it doesn't automatically apply to all file systems
>    on the server. This also provides a place for it to be documented, including limitations.

Well obviously I would like it, becasue we need it and I don't want to 
have to maintain patches going forward.
If it is in the tree YOU work on then it would automatically get 
updated as needed. The mount option is nice, but at the moment we just 
have it wired on, and only export a single (PZFS) hierarchy. (PZFS is 
our own heavily modified version of ZFS that uses Amazon storage(*) as 
a block backend in parallel with the local drives (which are more a 
cache.. the cloud is authoritative).

(*) a gross simplification.

Different parts of the hierarchy are actually be different cloud 
'buckets',, (e.g. theoretically some could be amazon and some could be 
google cloud storage).  These sub-filesystems are unified as a 
hierachy of ZFS filesystems into a single storage hierarchy via PZFS 
and exported to the user via NFS and CIFS/Samba.

If I need to maintain a separate set of changes for the option then 
that's life. but it's of course preferable to me to have it upstreamed.

p.s. to any Filesystem types.. yes we are hiring FreeBSD filesystem 
people..
http://panzura.com/company/careers-panzura/senior-software-engineer/.. 
resumes via me for fast track  :-)   ..





>
> Anyhow, if anyone has an opinion on whether ir not this should be in FreeBSD, please post, rick
>




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?55C037D0.1000606>