Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 29 Jan 2016 16:10:16 -0500
From:      Paul Kraus <paul@kraus-haus.org>
To:        Graham Allan <allan@physics.umn.edu>, FreeBSD Filesystems <freebsd-fs@freebsd.org>
Subject:   Re: quantifying zpool performance with number of vdevs
Message-ID:  <7E3F58C9-94ED-4491-A0FD-7AAB413F2E03@kraus-haus.org>
In-Reply-To: <56ABAA18.90102@physics.umn.edu>
References:  <56ABAA18.90102@physics.umn.edu>

next in thread | previous in thread | raw e-mail | index | archive | help
On Jan 29, 2016, at 13:06, Graham Allan <allan@physics.umn.edu> wrote:

> In many of the storage systems I built to date I was slightly =
conservative (?) in wanting to keep any one pool confined to a single =
JBOD chassis. In doing this I've generally been using the Supermicro =
45-drive chassis with pools made of 4x (8+2) raidz2, other slots being =
kept for spares, ZIL and L2ARC.

> Obviously theory says that iops should scale with number of vdevs but =
it would be nice to try and quantify.
>=20
> Getting relevant data out of iperf seems problematic on machines with =
128GB+ RAM - it's hard to blow out the ARC.

In a pervious life, where I was responsible for over 200 TB of storage =
(in 2008, back when that was a lot), I did some testing for both =
reliability and performance before committing to a configuration for our =
new storage system. It was not FreeBSD but Solaris and we have 5 x J4400 =
chassis (each with 24 drives) all dual SAS attached on four HBA ports.

This link =
https://docs.google.com/spreadsheets/d/13sLzYKkmyi-ceuIlUS2q0oxcmRnTE-BRvB=
YHmEJteAY/edit?usp=3Dsharing has some of the performance testing I did. =
I did not look at Sequential Read as that was not in our workload, in =
hindsight I should have. By limiting the ARC, the entire ARC, to 4 GB I =
was able to get reasonable accurate results. The number of vdevs made =
very little difference to Sequential Writes, but Random Reads and Writes =
scaled very linearly with the number of top level vdevs.

Our eventual config was RAIDz2 based because we could not meet the space =
requirements with mirrors, especially as we would have to have gone with =
3-way mirrors to get the same MTTDL as with the RAIDz2. The production =
pool consisted of 22 top level vdevs, each was a 5-drive RAIDz2 where =
each drive was a in a different disk chassis. So all of the drives in =
slot 0 and 1 were hot spares, all of the drives in slot 2 made up one =
vdev, all of the drives in slot 3 made up one vdev, etc. So we were =
striping data across 22 vdevs. During pre-production testing we =
completely lost connectivity to 2 of the 5 disk chassis and had no loss =
of data or availability. When those chassis came back, they resilvered =
and went along their merry way (just as they should).

Once the system went live we took hourly snapshots and replicated them =
both locally and remotely for backup purposes. We estimated that it =
would have taken over 3 weeks to restore all the data from tape if we =
had to, and that was unacceptable. The only issue we ran into related to =
resilvering after a drive failure. Due to the large number of snapshots =
and the ongoing snapshot creation, a resilver could take over a week.

--
Paul Kraus
paul@kraus-haus.org




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?7E3F58C9-94ED-4491-A0FD-7AAB413F2E03>