Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 13 Nov 2013 23:06:23 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        svn-src-head@FreeBSD.org, John-Mark Gurney <jmg@funkthat.com>, src-committers@FreeBSD.org, svn-src-all@FreeBSD.org, Bruce Evans <brde@optusnet.com.au>
Subject:   Re: svn commit: r258039 - in head/sys: kern vm
Message-ID:  <20131113215529.G1034@besplex.bde.org>
In-Reply-To: <20131113072823.GB59496@kib.kiev.ua>
References:  <201311120847.rAC8lwi8053235@svn.freebsd.org> <20131112215200.Y1480@besplex.bde.org> <20131113002906.GZ2279@funkthat.com> <20131113072823.GB59496@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 13 Nov 2013, Konstantin Belousov wrote:

> On Tue, Nov 12, 2013 at 04:29:06PM -0800, John-Mark Gurney wrote:
>> Bruce Evans wrote this message on Tue, Nov 12, 2013 at 22:13 +1100:
>>> On Tue, 12 Nov 2013, Konstantin Belousov wrote:
>>>
>>>> Log:
>>>> Avoid overflow for the page counts.
>>>>
>>>> Reported and tested by:	pho
>>>> Sponsored by:	The FreeBSD Foundation
>>>> MFC after:	1 week
>>>
>>>> Modified: head/sys/kern/vfs_vnops.c
>>>> ==============================================================================
>>>> --- head/sys/kern/vfs_vnops.c	Tue Nov 12 08:32:10 2013	(r258038)
>>>> +++ head/sys/kern/vfs_vnops.c	Tue Nov 12 08:47:58 2013	(r258039)
>>>> @@ -933,8 +933,9 @@ vn_io_fault(struct file *fp, struct uio
>>>> 	void *rl_cookie;
>>>> 	struct mount *mp;
>>>> 	vm_page_t *prev_td_ma;
>>>> -	int cnt, error, save, saveheld, prev_td_ma_cnt;
>>>> +	int error, save, saveheld, prev_td_ma_cnt;
>>>> 	vm_offset_t addr, end;
>>>> +	vm_size_t cnt;
>>>
>>> int was correct for a count.  You can't possibly have the 8TB
>>> of physical memory needed to overflow a 32-bit int page count.
>>                     ^ today
>>> It is reasonably to assume 32-bit ints.
>>
>> Except that the modern AMD64 arch now allows 52 bits of address for
>> phyiscal memory, which does mean in a few years, we will be eclipsing
>> 8TB in a single machine...
>>
>> Isn't someone running FreeBSD on a 1TB machine?  I'm pretty possitive
>> I remeber someone running 512GB, so only 3/4 doublings away from hitting
>> the limit...

8 TB at $10/GB costs $82K (but would cost a lot more for the quality needed
for that much).  Not very affordable for a few more than 3-4 doublings.

> The variable in question has no relation to the physical memory size.
> I already noted to Bruce, in the private reply, that the variable must
> hold the count of the page frames covering the i/o request.  As such,
> it must be sized to be able to hold the page count for the whole address
> space.
>
> The variable is clamped later, but this is an implementation detail.

An up-front clamp would work well:

% 	while (uio_clone->uio_resid != 0) {
% 		len = uio_clone->uio_iov->iov_len;
% 		if (len == 0) {
% 			KASSERT(uio_clone->uio_iovcnt >= 1,
% 			    ("iovcnt underflow"));
% 			uio_clone->uio_iov++;
% 			uio_clone->uio_iovcnt--;
% 			continue;
% 		}

Clamp len to (io_hold_cnt + 2) * PAGE_SIZE here.

% 		addr = (vm_offset_t)uio_clone->uio_iov->iov_base;
% 		end = round_page(addr + len);

There are still problems with this possibly overflowing...

% 		cnt = howmany(end - trunc_page(addr), PAGE_SIZE);

... and this.  Note that using unsigned types doesn't necessarily
avoid these problems, depending on the relative sizes of the types.
Unsigned types may be promoted to signed types, or vm_size_t might be
too small to hold a difference of addresses.  With the up-front clamp,
overflow is only possible if addr is so high that adding the clamped
len to it overflows.  addr being that high is only slightly more likely
than an unusual type combination that breaks the overflow checking.

These problems can be avoided by checking that end >= addr.  Then
end - trunc_page(addr) >= 0 and is <= (io_hold_cnt + 2 + 2) * PAGE_SIZE
by the clamp.  Equivalently, 0 <= howmany(...) <= io_hold_cnt + 2 + 2.

After rounding addr + len up and addr down, end - trunc_page(addr) is
an exact multiple of PAGE_SIZE, and it is a style bug to not use a
standard macro for converting sizes to page counts on it.  Alternatively,
howmany() is designed to be used with fractional objects, so you could
skip the round_page() on addr + len and use howmany() to round up.  This
avoids using the standard vm APIs more consistently.

% 		/*
% 		 * A perfectly misaligned address and length could cause
% 		 * both the start and the end of the chunk to use partial
% 		 * page.  +2 accounts for such a situation.
% 		 */
% 		if (cnt > io_hold_cnt + 2) {

After clamping and checking more carefully up front, this is unnecessary
unless you want to restrict from io_hold_count + 2 + 2 to io_hold_count.
I added the 2 up-front since I don't want to change the semantics, but
it probably doesn't matter.

% 			len = io_hold_cnt * PAGE_SIZE;
% 			KASSERT(howmany(round_page(addr + len) -
% 			    trunc_page(addr), PAGE_SIZE) <= io_hold_cnt + 2,
% 			    ("cnt overflow"));

The round_page() step is not needed here, as above.  'end' in the above
is only used once now, and is only needed for doing the careful wraparound
check above.  If we let howmany() do the rounding up, then we can't do
this wraparound check so clearly, and if we do a delayed check then we
don't need a variable for 'end', as here.

% 		}

Untested cleanups:

@ 	while (uio_clone->uio_resid != 0) {
@ 		len = uio_clone->uio_iov->iov_len;
@ 		if (len == 0) {
@ 			KASSERT(uio_clone->uio_iovcnt >= 1,
@ 			    ("iovcnt underflow"));
@ 			uio_clone->uio_iov++;
@ 			uio_clone->uio_iovcnt--;
@ 			continue;
@ 		}
@ 		if (len > io_hold_cnt * PAGE_SIZE)
@ 			len = io_hold_cnt * PAGE_SIZE;
@ 		addr = (uintptr_t)uio_clone->uio_iov->iov_base;
@ 		end = round_page(addr + len);
@ 		/* XXX shouldn't panic, but we did before: */
@ 		KASSERT(end >= addr, ("address wrap"));
@ 		cnt = atop(end - trunc_page(addr));
@ 		/*
@ 		 * A perfectly misaligned address and length could cause
@ 		 * both the start and the end of the chunk to use partial
@ 		 * page.  +2 accounts for such a situation.
@ 		 */

The comment is now about the +2 in the vm_fault_quick_hold_pages() call.

Other changes in this:
- fix cast for addr.  Only MD and central vm code should assume that
   vm_offset_t is precisely uintmax_t.
- fix some style bugs (extra blank line).

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20131113215529.G1034>