Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 19 Jun 2016 19:29:12 -0400 (EDT)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Jordan Hubbard <jkh@ixsystems.com>
Cc:        Chris Watson <bsdunix44@gmail.com>, freebsd-fs <freebsd-fs@freebsd.org>,  Alexander Motin <mav@freebsd.org>
Subject:   Re: pNFS server Plan B
Message-ID:  <1845469514.159182764.1466378952929.JavaMail.zimbra@uoguelph.ca>
In-Reply-To: <B2907C1F-D32A-48FB-8E58-209E6AF1E86D@ixsystems.com>
References:  <1524639039.147096032.1465856925174.JavaMail.zimbra@uoguelph.ca> <D20C793E-A2FD-49F3-AD88-7C2FED5E7715@ixsystems.com> <7E27FA25-E18F-41D3-8974-EAE1EACABF38@gmail.com> <B2907C1F-D32A-48FB-8E58-209E6AF1E86D@ixsystems.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Jordan Hubbard wrote:
>=20
> > On Jun 18, 2016, at 6:14 PM, Chris Watson <bsdunix44@gmail.com> wrote:
> >=20
> > Since Jordan brought up clustering, I would be interested to hear Justi=
n
> > Gibbs thoughts here. I know about a year ago he was asked on an "after
> > hours" video chat hosted by Matt Aherns about a feature he would really
> > like to see and he mentioned he would really like, in a universe filled
> > with time and money I'm sure, to work on a native clustering solution f=
or
> > FreeBSD. I don't know if he is subscribed to the list, and I'm certainl=
y
> > not throwing him under the bus by bringing his name up, but I know he h=
as
> > at least been thinking about this for some time and probably has some
> > value to add here.
>=20
> I think we should also be careful to define our terms in such a discussio=
n.
> Specifically:
>=20
> 1. Are we talking about block-level clustering underneath ZFS (e.g. HAST =
or
> ${somethingElse}) or otherwise incorporated into ZFS itself at some low
> level?  If you Google for =E2=80=9CHigh-availability ZFS=E2=80=9D you wil=
l encounter things
> like RSF-1 or the somewhat more mysterious Zetavault
> (http://www.zeta.systems/zetavault/high-availability/) but it=E2=80=99s n=
ot entirely
> clear how these technologies work, they simply claim to =E2=80=9Cscale-ou=
t ZFS=E2=80=9D or
> =E2=80=9Ccluster ZFS=E2=80=9D (which can be done within ZFS or one level =
above and still
> probably pass the Marketing Test for what people are willing to put on a =
web
> page).
>=20
> 2. Are we talking about clustering at a slightly higher level, in a
> filesystem-agnostic fashion which still preserves filesystem semantics?
>=20
> 3. Are we talking about clustering for data objects, in a fashion which d=
oes
> not necessarily provide filesystem semantics (a sharding database which c=
an
> store arbitrary BLOBs would qualify)?
>=20
For the pNFS use case I am looking at, I would say #2.

I suspect #1 sits at a low enough level that redirecting I/O via the pNFS l=
ayouts
isn't practical, since ZFS is taking care of block allocations, etc.

I see #3 as a separate problem space, since NFS deals with files and not ob=
jects.
However, GlusterFS maps file objects on top of the POSIX-like FS, so I supp=
ose that
could be done at the client end. (What glusterfs.org calls SwiftonFile, I t=
hink?)
It is also possible to map POSIX files onto file objects, but that sounds l=
ike more
work, which would need to be done under the NFS service.

> For all of the above:  Are we seeking to be compatible with any other
> mechanisms, or are we talking about a FreeBSD-only solution?
>=20
> This is why I brought up glusterfs / ceph / RiakCS in my previous comment=
s -
> when talking to the $users that Rick wants to involve in the discussion,
> they rarely come to the table asking for =E2=80=9Csome or any sort of clu=
stering,
> don=E2=80=99t care which or how it works=E2=80=9D - they ask if I can off=
er an S3 compatible
> object store with horizontal scaling, or=20

> if they can use NFS in some
> clustered fashion where there=E2=80=99s a single namespace offering petab=
ytes of
> storage with configurable redundancy such that no portion of that namespa=
ce
> is ever unavailable.
>=20
I tend to think of this last case as the target for any pNFS server. The ba=
sic
idea is to redirect the I/O operations to wherever the data is actually sto=
red,
so that I/O performance doesn't degrade with scale.

If redundancy is a necessary feature, then maybe Plan A is preferable to Pl=
an B,
since GlusterFS does provide for redundancy and resilvering of lost copies,=
 at
least from my understanding of the docs on gluster.org.

I'd also like to see how GlusterFS performs on a typical Linux setup.
Even without having the nfsd use FUSE, access of GlusterFS via FUSE results=
 in crossing
user (syscall on mount) --> kernel --> user (glusterfs daemon) within the c=
lient machine,
if I understand how GlusterFS works. Then the gluster brick server glusterf=
sd daemon does
file system syscall(s) to get at the actual file on the underlying FS (xfs =
or ZFS or ...).
As such, there is already a lot of user<->kernel boundary crossings.
I wonder how much delay is added by the extra nfsd step for metadata?
- I can't say much about performance of Plan A yet, but metadata operations=
 are slow
  and latency seems to be the issue. (I actually seem to get better perform=
ance by
  disabling SMP, for example.)

> I=E2=80=99d be interested in what Justin had in mind when he asked Matt a=
bout this.
> Being able to =E2=80=9Cattach ZFS pools to one another=E2=80=9D in such a=
 fashion that all
> clients just see One Big Pool and ZFS=E2=80=99s own redundancy / snapshot=
ting
> characteristics magically apply to the =C3=BCberpool would be Pretty Cool=
,
> obviously, and would allow one to do round-robin DNS for NFS such that an=
y
> node could serve the same contents, but that also sounds pretty ambitious=
,
> depending on how it=E2=80=99s implemented.
>=20
This would probably work with the extant nfsd and wouldn't have a use for p=
NFS.
I also agree that this sounds pretty ambitious.

rick

> - Jordan
>=20
>=20



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1845469514.159182764.1466378952929.JavaMail.zimbra>