From owner-freebsd-fs@FreeBSD.ORG  Wed Jan 23 01:27:31 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 552818E2
 for <freebsd-fs@freebsd.org>; Wed, 23 Jan 2013 01:27:31 +0000 (UTC)
 (envelope-from freebsd@deman.com)
Received: from plato.corp.nas.com (plato.corp.nas.com [66.114.32.138])
 by mx1.freebsd.org (Postfix) with ESMTP id 1DBD394
 for <freebsd-fs@freebsd.org>; Wed, 23 Jan 2013 01:27:31 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
 by plato.corp.nas.com (Postfix) with ESMTP id EA31312E1E527
 for <freebsd-fs@freebsd.org>; Tue, 22 Jan 2013 17:27:23 -0800 (PST)
X-Virus-Scanned: amavisd-new at corp.nas.com
Received: from plato.corp.nas.com ([127.0.0.1])
 by localhost (plato.corp.nas.com [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id PSRNq6K78xZT for <freebsd-fs@freebsd.org>;
 Tue, 22 Jan 2013 17:27:22 -0800 (PST)
Received: from [192.168.0.116] (c-50-135-255-120.hsd1.wa.comcast.net
 [50.135.255.120])
 by plato.corp.nas.com (Postfix) with ESMTPSA id 1B9B512E1E51B
 for <freebsd-fs@freebsd.org>; Tue, 22 Jan 2013 17:27:22 -0800 (PST)
Content-Type: text/plain; charset=iso-8859-1
Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\))
Subject: Re: RFC: Suggesting ZFS "best practices" in FreeBSD
From: Michael DeMan <freebsd@deman.com>
In-Reply-To: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es>
Date: Tue, 22 Jan 2013 17:27:13 -0800
Content-Transfer-Encoding: quoted-printable
Message-Id: <AAE9CC17-B5C4-43DC-B86B-2F498FCA5AD4@deman.com>
References: <314B600D-E8E6-4300-B60F-33D5FA5A39CF@sarenet.es>
To: FreeBSD Filesystems <freebsd-fs@freebsd.org>
X-Mailer: Apple Mail (2.1499)
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 23 Jan 2013 01:27:31 -0000

I think this would be awesome.  Googling around it is extremely =
difficult to know what to do and which practices are current or =
obsolete, etc.

I would suggest maybe some separate sections so the information is =
organized well and can be easily maintained?


MAIN:=20
- recommended anybody using ZFS have a a 64-bit processor and 8GB RAM.
- I don't know, but it seems to me that much of what would go in here is =
fairly well known now and probably not changing much?

ROOT ON ZFS:
- section just for this

32-bit AND/OR TINY MEMORY:
- all the tuning needed for the people that aren't following recommended =
64-bit+8GB RAM setup.
- probably there are enough people even though it seems pretty obvious =
in a couple more years nobody will have 32-bit or less than 8GB RAM?


A couple more things for subsections in topic MAIN - lots of stuff to go =
in there...


PARTITIONING:
I could be disinformed here, but my understanding) is best practice is =
to use gpart + gnop to:
#1.  Ensure proper alignment for 4K sector drives - the latest western =
digitals still report as 512.
#2.  Ensure a little extra space is left on the drive since if the whole =
drive is used, a replacement may be a tiny bit smaller and will not =
work.
#3.  Label the disks so you know what is what.

MAPPING PHYSICAL DRIVES:
Particularly and issue with SATA drives - basically force the mapping so =
if the system reboots with a drive missing (or you add drives) you know =
what is what.
- http://lists.freebsd.org/pipermail/freebsd-fs/2011-March/011039.html
- so you can put a label on the disk caddies and when the system says =
'diskXYZ' died - you can just look at the label on the front of the box =
and change 'diskXYZ'.
- also without this - if you reboot after adding disks or with a disk =
missing - all the adaXYZ numbering shifts :(


SPECIFIC TUNABLES
- there are still a myriad of specific tunables that can be very helpful =
even with a 8GB+=20

ZFS GENERAL BEST PRACTICES - address the regular ZFS stuff here=20
- why the ZIL is a good thing even you think it kills your NFS =
performance
- no vdevs > 8 disks, raidz1 best with 5 disks, raidz2 best with 6 =
disks, etc.
- striping over raidz1/raidz2 pools
- striping over mirrors
- etc...


On Jan 22, 2013, at 3:03 AM, Borja Marcos <borjam@sarenet.es> wrote:

> (Scott, I hope you don't mind to be CC'd, I'm not sure you read the =
-FS mailing list, and this is a SCSI//FS issue)
>=20
>=20
>=20
> Hi :)
>=20
> Hope nobody will hate me too much, but ZFS usage under FreeBSD is =
still chaotic. We badly need a well proven "doctrine" in order to avoid =
problems. Especially, we need to avoid the braindead Linux HOWTO-esque =
crap of endless commands for which no rationale is offered at all, and =
which mix personal preferences and even misconceptions as "advice" (I =
saw one of those howtos which suggested disabling checksums "because =
they are useless").
>=20
> ZFS is a very different beast from other filesystems, and the setup =
can involve some non-obvious decisions. Worse, Windows oriented server =
vendors insist on bundling servers with crappy raid controllers which =
tend to make things worse.
>=20
> Since I've been using ZFS on FreeBSD (from the first versions) I have =
noticed several serious problems. I try to explain some of them, and my =
suggestions for a solution. We should collect more use cases and issues =
and try to reach a consensus.=20
>=20
>=20
>=20
> 1- Dynamic disk naming -> We should use static naming (GPT labels, for =
instance)
>=20
> ZFS was born in a system with static device naming (Solaris). When you =
plug a disk it gets a fixed name. As far as I know, at least from my =
experience with Sun boxes, c1t3d12 is always c1t3d12. FreeBSD's dynamic =
naming can be very problematic.
>=20
> For example, imagine that I have 16 disks, da0 to da15. One of them, =
say, da5, dies. When I reboot the machine, all the devices from da6 to =
da15 will be renamed to the device number -1. Potential for trouble as a =
minimum.
>=20
> After several different installations, I am preferring to rely on =
static naming. Doing it with some care can really help to make pools =
portable from one system to another. I create a GPT partition in each =
drive, and Iabel it with a readable name. Thus, imagine I label each big =
partition (which takes the whole available space) as pool-vdev-disk, for =
example, pool-raidz1-disk1.
>=20
> When creating a pool, I use these names. Instead of dealing with =
device numbers. For example:=20
>=20
> % zpool status
>  pool: rpool
> state: ONLINE
>  scan: scrub repaired 0 in 0h52m with 0 errors on Mon Jan  7 16:25:47 =
2013
> config:
>=20
> 	NAME             STATE     READ WRITE CKSUM
> 	rpool            ONLINE       0     0     0
> 	  mirror-0       ONLINE       0     0     0
> 	    gpt/rpool-disk1       ONLINE       0     0     0
> 	    gpt/rpool-disk2       ONLINE       0     0     0
> 	logs
> 	  gpt/zfs-log    ONLINE       0     0     0
> 	cache
> 	  gpt/zfs-cache  ONLINE       0     0     0
>=20
> Using a unique name for each disk within your organization is =
important. That way, you can safely move the disks to a different =
server, which might be using ZFS, and still be able to import the pool =
without name collisions. Of course  you could use gptids, which, as far =
as I know, are unique, but they are difficult to use and in case  of a =
disk failure it's not easy to determine which disk to replace.
>=20
>=20
>=20
>=20
> 2- RAID cards.
>=20
> Simply: Avoid them like the pest. ZFS is designed to operate on bare =
disks. And it does an amazingly good job. Any additional software layer =
you add on top will compromise it. I have had bad experiences with "mfi" =
and "aac" cards.=20
>=20
> There are two solutions adopted by RAID card users. None of them is =
good. The first an obvious one is to create a RAID5 taking advantage of =
the battery based cache (if present). It works, but it loses some of the =
advantages of ZFS. Moreover, trying different cards, I have been forced =
to reboot whole servers in order to do something trivial like replacing =
a failed disk. Yes, there are software tools to control some of the =
cards, but they are at the very least cumbersome and confusing.
>=20
> The second "solution" is to create a RAID0 volume for each disk (some =
RAID card manufacturers even dare to call it JBOD). I haven't seen a =
single instance of this working flawlessly. Again, a replaced disk can =
be a headache. At the very least, you have to deal with a cumbersome and =
complicated management program to replace a disk, and you often have to =
reboot the server.
>=20
> The biggest reason to avoid these stupid cards, anyway, is plain =
simple: Those cards, at least the ones I have tried bundled by Dell as =
PERC(insert a random number here) or Sun, isolate the ASC/ASCQ sense =
codes from the filesystem. Pure crap.
>=20
> Years ago, fighting this issue, and when ZFS was still rather =
experimental, I asked for help and Scott Long sent me a "don't try this =
at home" simple patch, so that the disks become available to the CAM =
layer, bypassing the RAID card. He warned me of potential issues and =
lost sense codes, but, so far so good. And indeed the sense codes are =
lost when a RAID card creates a volume, even if in the misnamed "JBOD" =
configuration.=20
>=20
>=20
> =
http://www.mavetju.org/mail/view_message.php?list=3Dfreebsd-scsi&id=3D2634=
817&raw=3Dyes
> http://comments.gmane.org/gmane.os.freebsd.devel.scsi/5679
>=20
> Anyway, even if there might be some issues due to command handling, =
the end to end verification performed by ZFS should ensure that, as a =
minimum, the data on the disks won't  be corrupted and, in case it =
happens, it will be detected. I rather prefer to have ZFS deal with it, =
instead of working on a sort of "virtual" disk implemented on the RAID =
card.
>=20
> Another *strong* reason to avoid those cards, even "JBOD" =
configurations, is disk portability. The RAID labels the disks. Moving =
one disk from one machine to another one will result on a funny =
situation of confusing "import foreign config/ignore" messages when =
rebooting the destination server (mandatory in order to be able to =
access the transferred disk). Once again, additional complexity, useless =
layering and more reboots. That may be acceptable for Metoosoft crap, =
not for Unix systems.
>=20
> Summarizing: I would *strongly* recommend to avoid the RAID cards and =
get proper host adapters without any fancy functionalities instead. The =
one sold by Dell as H200 seems to work very well. No need to create any =
JBOD or fancy thing at all. It will just expose the drivers as normal =
SAS/SATA ones. A host adapter without fancy firmware is the best =
guarantee about failures caused by fancy firmware.
>=20
> But, in case that=B4s not possible, I am still leaning to the kludge =
of bypassing the RAID functionality, and even avoiding the JBOD/RAID0 =
thing by patching the driver. There is one issue, though. In case of =
reboot, the RAID cards freeze, I am not sure why. Maybe that could be =
fixed,  it happens on machines on which I am not using the RAID =
functionality at all. They should become "transparent" but they don't.=20=

>=20
> Also, I think that  the so-called JBOD thing would impair the correct =
performance of a zfs health daemon doing things such as automatic failed =
disk replacement by hot-spares, etc. And there won't be a real ASC/ASCQ =
log message for diagnosis.
>=20
> (See at the bottom to read about a problem I have just had with a =
"JBOD" configuration)
>=20
>=20
>=20
>=20
> 3- Installation, boot, etc.
>=20
> Here I am not sure. Before zfsboot became available, I used to create =
a zfs-on-root system by doing, more or less, this:
>=20
> - Install base system on a pendrive. After the installation, just =
/boot will be used  from the pendrive, and /boot/loader.conf will=20
>=20
> - Create the ZFS pool.
>=20
> - Create and populate the root hierarchy. I used to create something =
like:
>=20
> pool/root
> pool/root/var
> pool/root/usr
> pool/root/tmp
>=20
> Why pool/root instead of simply "pool"? Because it's easier to =
understand, snapshot, send/receive, etc. Why in a hierarchy? Because, if =
needed, it's possible to snapshot the whole "system" tree atomically.=20
>=20
> I also set the mountpoint of the "system" tree as legacy, and rely on =
/etc/fstab. Why? In order to avoid an accidental "auto mount"  of =
critical filesystems in case, for example, I boot off a pendrive in =
order to tinker.=20
>=20
> For the last system I installed, I tried with zfsboot instead of =
booting off the /boot directory of a FFS partition.
>=20
>=20
>=20
>=20
> (*) An example of RAID/JBOD induced crap and the problem of not using =
static naming follows,=20
>=20
> I am using a Sun server running FreeBSD. It has 16 160 GB SAS disks, =
and one of those cards I worship: this particular example is controlled =
by the aac driver.=20
>=20
> As I was going to tinker a lot, I decided to create a raid-based =
mirror for the system, so that I can boot off it and have swap even with =
a failed disk, and use the other 14 disks as a pool with two raidz vdevs =
of 6 disks, leaving two disks as hot-spares. Later  I removed one of the =
hot-spares and I installed a SSD disk with two partitions to try and =
make it work as L2ARC  and log. As I had gone for the jbod pain, of =
course replacing that disk meant rebooting the server in order to do =
something as illogical as creating a "logical" volume on top of it. =
These cards just love to be rebooted.
>=20
>  pool: pool
> state: ONLINE
>  scan: resilvered 7.79G in 0h33m with 0 errors on Tue Jan 22 10:25:10 =
2013
> config:
>=20
> 	NAME             STATE     READ WRITE CKSUM
> 	pool             ONLINE       0     0     0
> 	  raidz1-0       ONLINE       0     0     0
> 	    aacd1        ONLINE       0     0     0
> 	    aacd2        ONLINE       0     0     0
> 	    aacd3        ONLINE       0     0     0
> 	    aacd4        ONLINE       0     0     0
> 	    aacd5        ONLINE       0     0     0
> 	    aacd6        ONLINE       0     0     0
> 	  raidz1-1       ONLINE       0     0     0
> 	    aacd7        ONLINE       0     0     0
> 	    aacd8        ONLINE       0     0     0
> 	    aacd9        ONLINE       0     0     0
> 	    aacd10       ONLINE       0     0     0
> 	    aacd11       ONLINE       0     0     0
> 	    aacd12       ONLINE       0     0     0
> 	logs
> 	  gpt/zfs-log    ONLINE       0     0     0
> 	cache
> 	  gpt/zfs-cache  ONLINE       0     0     0
> 	spares
> 	  aacd14         AVAIL  =20
>=20
> errors: No known data errors
>=20
>=20
>=20
> The fun begun when a disk failed. When it happened, I offlined it, and =
replaced it by the remaining hot-spare. But something had changed, and =
the pool remained in this state:
>=20
> % zpool status
>  pool: pool
> state: DEGRADED
> status: One or more devices has been taken offline by the =
administrator.
> 	Sufficient replicas exist for the pool to continue functioning =
in a
> 	degraded state.
> action: Online the device using 'zpool online' or replace the device =
with
> 	'zpool replace'.
>  scan: resilvered 192K in 0h0m with 0 errors on Wed Dec  5 08:31:57 =
2012
> config:
>=20
> 	NAME                        STATE     READ WRITE CKSUM
> 	pool                        DEGRADED     0     0     0
> 	  raidz1-0                  DEGRADED     0     0     0
> 	    spare-0                 DEGRADED     0     0     0
> 	      13277671892912019085  OFFLINE      0     0     0  was =
/dev/aacd1
> 	      aacd14                ONLINE       0     0     0
> 	    aacd2                   ONLINE       0     0     0
> 	    aacd3                   ONLINE       0     0     0
> 	    aacd4                   ONLINE       0     0     0
> 	    aacd5                   ONLINE       0     0     0
> 	    aacd6                   ONLINE       0     0     0
> 	  raidz1-1                  ONLINE       0     0     0
> 	    aacd7                   ONLINE       0     0     0
> 	    aacd8                   ONLINE       0     0     0
> 	    aacd9                   ONLINE       0     0     0
> 	    aacd10                  ONLINE       0     0     0
> 	    aacd11                  ONLINE       0     0     0
> 	    aacd12                  ONLINE       0     0     0
> 	logs
> 	  gpt/zfs-log               ONLINE       0     0     0
> 	cache
> 	  gpt/zfs-cache             ONLINE       0     0     0
> 	spares
> 	  2388350688826453610       INUSE     was /dev/aacd14
>=20
> errors: No known data errors
> %=20
>=20
>=20
> ZFS was somewhat confused by the JBOD volumes, and it was impossible =
to end this situation. A reboot revealed that the card,  apparently, had =
changed volume numbers. Thanks to the resiliency of ZFS, I didn't lose a =
single bit of data, but the situation seemed to be risky. Finally I =
could fix it by replacing the failed disk, rebooting the whole server, =
of course, and doing a zpool replace. But the card added some confusion, =
and I still don't know what was the disk failure. No traces of a =
meaningful error message.=20
>=20
>=20
>=20
>=20
> Best regards,
>=20
>=20
>=20
>=20
>=20
>=20
> Borja.
>=20
>=20
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"