Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 15 Jun 2014 11:00:24 -0500
From:      Kevin Day <toasty@dragondata.com>
To:        Dennis Glatting <dg@pki2.com>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: [Fwd: Re: Large ZFS arrays?]
Message-ID:  <F071839B-ED6C-4515-B7C1-D7327CEF12B7@dragondata.com>
In-Reply-To: <1402846984.4722.363.camel@btw.pki2.com>
References:  <1402846984.4722.363.camel@btw.pki2.com>

next in thread | previous in thread | raw e-mail | index | archive | help

On Jun 15, 2014, at 10:43 AM, Dennis Glatting <dg@pki2.com> wrote:
>=20
> Total. I am looking at three pieces in total:
>=20
> * Two 1PT storage "blocks" providing load sharing and=20
>  mirroring for failover.
>=20
> * One 5PB storage block for on-line archives (3-5 years).
>=20
> The 1PB nodes will divided into something that makes sense, such as
> multiple SuperMicro 847 chassis with 3TB disks providing some number =
of
> volumes. Division is a function of application, such as a 100TB RAIDz2
> volumes for bulk storage whereas smaller 8TB volumes for active data,
> such as iSCSI, databases, and home directories.
>=20
> Thanks.


We=92re currently using multiples of the SuperMicro 847 chassis with 3TB =
and 4TB drives, and LSI 9207 controllers. Each 45 drive array is =
configured as 4 11 drive raidz2 groups, plus one hot spare.=20

A few notes:

1) I=92d highly recommend against grouping them together into one giant =
zpool unless you really really have to. We just spent a lot of time =
redoing everything so that each 45 drive array is its own =
zpool/filesystem. You=92re otherwise putting all your eggs into one very =
big basket, and if something went wrong you=92d lose everything rather =
than just a subset of your data. If you don=92t do this, you=92ll almost =
definitely have to run with sync=3Ddisabled, or the number of sync =
requests hitting every drive will kill write performance.

2) You definitely want a JBOD controller instead of a smart RAID =
controller. The LSI 9207 works pretty well, but when you exceed 192 =
drives it complains on boot up of running out of heap space and makes =
you press a key to continue, which then works fine. There is a very =
recently released firmware update for the card that seems to fix this, =
but we haven=92t completed testing yet. You=92ll also want to increase =
hw.mps.max_chains. The driver warns you when you need to, but you need =
to reboot to change this, and you=92re probably only going to discover =
this under heavy load.

3) We=92ve played with L2ARC ssd devices, and aren=92t seeing much =
gains. It appears that our active data set is so large that it=92d need =
a huge SSD to even hit a small percentage of our frequently used files. =
setting =93secondarycache=3Dmetadata=94 does seem to help a bit, but =
probably not worth the hassle for us. This probably will depend entirely =
on your workload though.

4) =93zfs destroy=94 can be excruciatingly expensive on large datasets. =
http://blog.delphix.com/matt/2012/07/11/performance-of-zfs-destroy/  =
It=92s a bit better now, but don=92t assume you can =93zfs destroy=94 =
without killing performance to everything.


If you have specific questions, I=92m happy to help, but I think most of =
the advice I can offer is going to be workload specific. If I had to do =
it all over again, I=92d probably break things down into many smaller =
servers than trying to put as much onto one.





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?F071839B-ED6C-4515-B7C1-D7327CEF12B7>