Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 23 Dec 2018 02:51:02 +0100
From:      Peter Eriksson <peter@ifm.liu.se>
To:        freebsd-fs@freebsd.org
Subject:   {Disarmed} Re: {Disarmed} Re: Suggestion for hardware for ZFS fileserver
Message-ID:  <3F3EC02F-B969-43E3-B9B5-342504ED0962@ifm.liu.se>
In-Reply-To: <CAEW%2BogaKTLsmXaUGk7rZWb7u2Xqja%2BpPBK5rduX0zXCjk=2zew@mail.gmail.com>
References:  <CAEW%2BogZnWC07OCSuzO7E4TeYGr1E9BARKSKEh9ELCL9Zc4YY3w@mail.gmail.com> <C839431D-628C-4C73-8285-2360FE6FFE88@gmail.com> <CAEW%2BogYWKPL5jLW2H_UWEsCOiz=8fzFcSJ9S5k8k7FXMQjywsw@mail.gmail.com> <4f816be7-79e0-cacb-9502-5fbbe343cfc9@denninger.net> <3160F105-85C1-4CB4-AAD5-D16CF5D6143D@ifm.liu.se> <YQBPR01MB038805DBCCE94383219306E1DDB80@YQBPR01MB0388.CANPRD01.PROD.OUTLOOK.COM> <D0E7579B-2768-46DB-94CF-DBD23259E74B@ifm.liu.se> <CAEW%2BogaKTLsmXaUGk7rZWb7u2Xqja%2BpPBK5rduX0zXCjk=2zew@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Can=E2=80=99t really give you some generic recommendations but on our Dell =
R730xd and R740xd servers we use the Dell HB330 SAS HBA card, also know as =
=E2=80=9CDell Storage Controller 12GB-SASHBA=E2=80=9D that uses the =E2=80=
=9Cmpr=E2=80=9D device driver. This is an LSI3008 based controller and work=
s really well. Only internal drives on the Dell servers though (730xd and 7=
40xd servers). Beware that this is not the same as the =E2=80=9CH330=E2=80=
=9D RAID controller that Dell normally sells you. We had to do a =E2=80=9Cs=
pecial=E2=80=9D in order to get the 10TB drives with 4K sectors with the HB=
A330 controller since Dell only would sell use the 10TB drives together wit=
h the H330 controller at the time we bought them. So we bought the HBA330:s=
 separately and swapped them ourself... And then we had to do a low-level r=
eformat of all the disks since Dell by default would deliver them formatted=
 with a nonstandard sector size (4160 bytes I think, or perhaps 4112) and =
=E2=80=9CProtection Information=E2=80=9D enabled (used and understood by th=
e H330 controller, but not FreeBSD when using HBAs). But that=E2=80=99s eas=
y to fix (just takes an hour or so per drive to do):

	# sg_format =E2=80=94size=3D4096 =E2=80=94fmtpinfo=3D0 /dev/da0

On our HP servers we use the HP Smart HBA H241 controller in HBA mode (set =
via the BIOS configuration page) connected to external HP D6030 SAS shelfs =
(70 disks per shelf). This is a HP special one that uses the =E2=80=9Cciss=
=E2=80=9D driver. Also works fine.

- Peter

> On 22 Dec 2018, at 15:49, Sami Halabi <sodynet1@gmail.com> wrote:
>=20
> Hi,
>=20
> What sas hba card do you recommend for 16/24 internal ports and 2 externa=
l that are recognized and work well with freebsd ZFS.
> Sami
>=20
> =D7=91=D7=AA=D7=90=D7=A8=D7=99=D7=9A =D7=A9=D7=91=D7=AA, 22 =D7=91=D7=93=
=D7=A6=D7=9E=D7=B3 2018, 2:48, =D7=9E=D7=90=D7=AA Peter Eriksson <peter@ifm=
.liu.se <mailto:peter@ifm.liu.se>>:
>=20
>=20
> > On 22 Dec 2018, at 00:49, Rick Macklem <rmacklem@uoguelph.ca <mailto:rm=
acklem@uoguelph.ca>> wrote:
> >=20
> > Peter Eriksson wrote:
> > [good stuff snipped]
> >> This has caused some interesting problems=E2=80=A6
> >>=20
> >> First thing we noticed was that booting would take forever=E2=80=A6 Mo=
unting the 20-100k >filesystems _and_ enabling them to be shared via NFS is=
 not done efficient at all (for >each filesystem it re-reads /etc/zfs/expor=
ts (a couple of times) befor appending one >line to the end. Repeat 20-100,=
000 times=E2=80=A6 Not to mention the big kernel lock for >NFS =E2=80=9Chol=
d all NFS activity while we flush and reinstalls all sharing information pe=
r >filesystem=E2=80=9D being done by mountd=E2=80=A6
> > Yes, /etc/exports and mountd were implemented in the 1980s, when a dozen
> > file systems would have been a large server. Scaling to 10,000 or more =
file
> > systems wasn't even conceivable back then.
>=20
> Yeah, for a normal user with non-silly amounts of filesystems this is a n=
on-issue. Anyway it=E2=80=99s the kind of issues that I kind of like to thi=
nk about how to solve. It=E2=80=99s fun :-)
>=20
>=20
> >> Wish list item #1: A BerkeleyDB-based =E2=80=99sharetab=E2=80=99 that =
replaces the horribly >slow /etc/zfs/exports text file.
> >> Wish list item #2: A reimplementation of mountd and the kernel interfa=
ce to allow >a =E2=80=9Cdiff=E2=80=9D between the contents of the DB-based =
sharetab above be input into the >kernel instead of the brute-force way it=
=E2=80=99s done now..
> > The parser in mountd for /etc/exports is already an ugly beast and I th=
ink
> > implementing a "diff" version will be difficult, especially figuring ou=
t what needs
> > to be deleted.
>=20
> Yeah, I tried to decode it (this summer) and I think I sort of got the ha=
ng of it eventually.=20
>=20
>=20
> > I do have a couple of questions related to this:
> > 1 - Would your case work if there was an "add these lines to /etc/expor=
ts"?
> >     (Basically adding entries for file systems, but not trying to delet=
e anything
> >      previously exported. I am not a ZFS guy, but I think ZFS just gene=
rates another
> >      exports file and then gets mountd to export everything again.)
>=20
> Yeah, the ZFS library that the zfs commands use just reads and updates th=
e separate /etc/zfs/exports text file (and have mountd read both /etc/expor=
ts and /etc/zfs/exports). The problem is that basically what it does when y=
ou tell it to =E2=80=9Czfs mount -a=E2=80=9D (mount all filesystems in all =
zpools) is a big (pseudocode):
>=20
> For P in ZPOOLS; do
>   For Z in ZFILESYSTEMS-AND-SNAPSHOTS in $P; do
>     Mount $Z
>     If $Z Have =E2=80=9Csharenfs=E2=80=9D option; Then
>        Open /etc/zfs/exports
>        Read until you find a matching line, replace with the options, els=
e if not found, Append options
>        Close /etc/zfs/exports
>        Signal mountd
>          (Which then opens /etc/exports and /etc/zfs/exports and does it=
=E2=80=99s magic)
>     End
>   End
> End
>=20
> All wrapped up in a Solaris compatibility layer I libzfs. Actually I thin=
k it even reads the /etc/zfs/exports file twice for each loop iteration due=
 to some abstractions. Btw things got really =E2=80=9Cfun=E2=80=9D when the=
 hourly snapshots we were taking (adding 10-20k new snapshots every hour, a=
nd we didn=E2=80=99t expire them fast enough in the beginning) triggered th=
e code above and that code took longer than 1 hour to execute - mountd was =
100% busy getting signalled and rereading, flushing and reinstalling export=
s into the kernel all the time) and basically never finished. Luckily we di=
dn=E2=80=99t have an NFS clients accessing the servers at that time :-)
>=20
> This summer I wrote some code to instead use a Btree BerkeleyDB file and =
modified the libzfs code and mountd daemon to instead use that database for=
 much faster lookups (no need to read the whole /etc/zfs/exports file all t=
he time) and additions. Worked pretty well actually and wasn=E2=80=99t that=
 hard to add. Wanted to also add a possibility to add =E2=80=9Cexports=E2=
=80=9D arguments =E2=80=9CSolaris=E2=80=9D-style so one could say things li=
ke:
>=20
>         /export/staff   vers=3D4,sec=3Dkrb5:krb5i:krb5p,rw=3DMailScanner =
warning: numerical links are often malicious: 130.236.0.0/16,sec=3Dsys,ro=
=3D130.236.160.0/24:10.1.2.3 <http://130.236.0.0/16,sec=3Dsys,ro=3D130.236.=
160.0/24:10.1.2.3>
>=20
> But I never finished that (solaris-style exports options) part=E2=80=A6.
>=20
> We=E2=80=99ve lately been toying with putting the NFS sharing stuff into =
separate =E2=80=9Cprivate" ZFS attribute (separate from official =E2=80=9Cs=
harenfs=E2=80=9D one) and have another tool to read them instead and genera=
te another =E2=80=9Cexports=E2=80=9D file so that file can be generated in =
=E2=80=9Cone go=E2=80=9D and just signal mountd once after all filesystems =
have been mounted. Unfortunately that would mean that they won=E2=80=99t be=
 shared until after all of them have been mounted but we think it would tak=
e less time all-in-all.
>=20
> We also modified the FreeBSD boot scripts so that we make sure to first m=
ount all most important ZFS filesystems that is needed on the boot disks (n=
ot just /) and then we mount (and share via NFS the rest in the background =
so we can login to the machine as root early (no need for everything to hav=
e been mounted before giving us a login prompt).
>=20
> (Right now a reboot of the bigger servers take an hour or two before all =
filesystems are mounted and exported).
>=20
>=20
> > 2 - Are all (or maybe most) of these ZFS file systems exported with the=
 same
> >      arguments?
> >      - Here I am thinking that a "default-for-all-ZFS-filesystems" line=
 could be
> >         put in /etc/exports that would apply to all ZFS file systems no=
t exported
> >         by explicit lines in the exports file(s).
> >      This would be fairly easy to implement and would avoid trying to h=
andle
> >      1000s of entries.
>=20
> For us most have exactly the same exports arguments. (We set options on t=
he top level filsystems (/export/staff, /export/students etc) and then all =
home dirs inherit those.
>=20
> > In particular, #2 above could be easily implemented on top of what is a=
lready
> > there, using a new type of line in /etc/exports and handling that as a =
special
> > case by the NFS server code, when no specific export for the file syste=
m to the
> > client is found.
> >=20
> >> (I=E2=80=99ve written some code that implements item #1 above and it h=
elps quite a bit. >Nothing near production quality yet though. I have looke=
d at item #2 a bit too but >not done anything about it.)
> > [more good stuff snipped]
> > Btw, although I put the questions here, I think a separate thread discu=
ssing
> > how to scale to 10000+ file systems might be useful. (On freebsd-fs@ or
> > freebsd-current@. The latter sometimes gets the attention of more devel=
opers.)
>=20
> Yeah, probably a good idea!
>=20
> - Peter
>=20
> > rick
> >=20
> >=20
>=20
> _______________________________________________
> freebsd-fs@freebsd.org <mailto:freebsd-fs@freebsd.org> mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-fs <https://lists.free=
bsd.org/mailman/listinfo/freebsd-fs>
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org <mai=
lto:freebsd-fs-unsubscribe@freebsd.org>"




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3F3EC02F-B969-43E3-B9B5-342504ED0962>