Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 06 Dec 2010 23:10:38 +0200
From:      Alexander Motin <mav@FreeBSD.org>
To:        John Baldwin <jhb@freebsd.org>
Cc:        svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org, Pawel Jakub Dawidek <pjd@freebsd.org>, Ivan Voras <ivoras@freebsd.org>
Subject:   Re: svn commit: r216230 - head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs
Message-ID:  <4CFD514E.8010103@FreeBSD.org>
In-Reply-To: <201012061518.49835.jhb@freebsd.org>
References:  <201012061218.oB6CI3oW032770@svn.freebsd.org> <AANLkTine9rGq_cM4ruFXYq=-F7cMXcQAr-zKHuWoQs2z@mail.gmail.com> <20101206195327.GD1936@garage.freebsd.pl> <201012061518.49835.jhb@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On 06.12.2010 22:18, John Baldwin wrote:
> On Monday, December 06, 2010 2:53:27 pm Pawel Jakub Dawidek wrote:
>> On Mon, Dec 06, 2010 at 08:35:36PM +0100, Ivan Voras wrote:
>>> Please persuade me on technical grounds why ashift, a property
>>> intended for address alignment, should not be set in this way. If your
>>> answer is "I don't know but you are still wrong because I say so" I
>>> will respect it and back it out but only until I/we discuss the
>>> question with upstream ZFS developers.
>>
>> No. You persuade me why changing ashift in ZFS, which, as the comment
>> clearly states is "device's minimum transfer size" is better and not
>> hackish than presenting the disk with properly configured sector size.
>> This can not only affect disks that still use 512 bytes sectors, but
>> doesn't fix the problem at all. It just works around the problem in ZFS
>> when configured on top of raw disks.

Both ATA and SCSI standards implemented support for different logical 
and physical sector sizes. It is not a hack - it seems to be the way 
manufacturers decided to go. At least on their words. IMHO hack in this 
situation would be to report to GEOM some fake sector size, different 
from one reported by device. In any way it is the main visible disk 
characteristic, independently of what it's firmware does inside.

>> What about other file systems? What about other GEOM classes? GELI is
>> great example here, as people use ZFS on top of GELI alot. GELI
>> integrity verification works in a way that not reporting disk sector
>> size properly will have huge negative performance impact. ZFS' ashift
>> won't change that.
>
> I am mostly on your side here, but I wonder if GELI shouldn't prefer the
> stripesize anyway?  For example, if you ran GELI on top of RAID-5 I imagine it
> would be far more performant for it to use stripe-size logical blocks instead
> of individual sectors for the underlying media.
>
> The RAID-5 argument also suggests that other filesystems should probably
> prefer stripe sizes to physical sector sizes when picking block sizes, etc.

Looking further I can see use even for several "stripesize" values on 
that way, unrelated to logical sector size.

Let's take an example: 5 disks with 4K physical sectors in RAID5 with 
64K strip. We'll have three sizes to align at: 4K, 64K and 256K. 
Aligning to 4K allow to avoid read-modify-write on disk level; to 64K - 
avoid request splitting and so increase (up to double) parallel random 
read performance; to 256K - significantly increase write speed by 
avoiding read-modify-write on RAID5.

How can it be used? We can easily align partition to the biggest of them 
- 256K, to give maximum chances to any file system to align properly. 
UFS allocates space and writes data in granularity of blocks - depending 
on specific situation we may wish to increase block size to 64K, but 
it's quite a big value, so depends. We can safely increase fragment size 
to 4K. Also we could make UFS read-ahead and write-back code to align 
I/Os in run-time to the reported blocks. Depending on situation both 64K 
and 256K could be reasonable candidates for it. Sure solution is 
somewhat engineering (not absolute) in each case, but IMHO reasonable.

Specific usage for these values (512, 4K, 64K and 256K) depends on 
abilities of specific partitioning scheme and file system. Neither disk 
driver nor GEOM may know what will be more usable at each next level. 
512 bytes is the only one critically important value in this situation; 
everything else is only optimization.

-- 
Alexander Motin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4CFD514E.8010103>