From owner-freebsd-bugs  Mon Jul  3  7:20: 9 2000
Delivered-To: freebsd-bugs@freebsd.org
Received: from freefall.freebsd.org (freefall.FreeBSD.ORG [204.216.27.21])
	by hub.freebsd.org (Postfix) with ESMTP id A505D37B8FB
	for <freebsd-bugs@FreeBSD.org>; Mon,  3 Jul 2000 07:20:01 -0700 (PDT)
	(envelope-from gnats@FreeBSD.org)
Received: (from gnats@localhost)
	by freefall.freebsd.org (8.9.3/8.9.2) id HAA91741;
	Mon, 3 Jul 2000 07:20:01 -0700 (PDT)
	(envelope-from gnats@FreeBSD.org)
Received: from plab.ku.dk (plab.ku.dk [130.225.105.65])
	by hub.freebsd.org (Postfix) with ESMTP id 30DB837B822
	for <FreeBSD-gnats-submit@freebsd.org>; Mon,  3 Jul 2000 07:16:33 -0700 (PDT)
	(envelope-from tobez@plab.ku.dk)
Received: from lion.plab.ku.dk (lion.plab.ku.dk [130.225.105.49])
	by plab.ku.dk (8.9.3/8.9.3) with ESMTP id QAA44028
	for <FreeBSD-gnats-submit@freebsd.org>; Mon, 3 Jul 2000 16:17:58 +0200 (CEST)
	(envelope-from tobez@plab.ku.dk)
Received: (from tobez@localhost)
	by lion.plab.ku.dk (8.9.3/8.9.3) id QAA93991;
	Mon, 3 Jul 2000 16:16:34 +0200 (CEST)
	(envelope-from tobez)
Message-Id: <200007031416.QAA93991@lion.plab.ku.dk>
Date: Mon, 3 Jul 2000 16:16:34 +0200 (CEST)
From: tobez@tobez.org
Reply-To: tobez@tobez.org
To: FreeBSD-gnats-submit@freebsd.org
X-Send-Pr-Version: 3.2
Subject: kern/19672: contigmalloc1() oddity for large alignments (race condition)
Sender: owner-freebsd-bugs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org


>Number:         19672
>Category:       kern
>Synopsis:       contigmalloc1() oddity for large alignments (race condition)
>Confidential:   no
>Severity:       serious
>Priority:       low
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Mon Jul 03 07:20:01 PDT 2000
>Closed-Date:
>Last-Modified:
>Originator:     Anton Berezin
>Release:        FreeBSD 5.0-CURRENT i386
>Organization:
tobez.org
>Environment:

Most versions of FreeBSD, as far as I can tell.
File:           src/sys/vm/vm_page.c
Function:       contigmalloc1()

>Description:

If an object is requested with a large alignment, say, 1<<24, so that
contigmalloc1() is not even able to find a single PQ_FREE or PQ_CACHE
page with said alignment, it then proceeds freeing inactive pages, one
by one, and then immediately active pages as well, also one by one.

The problem is, that after freeing a page (in most cases  the routine
pages them out --- I inserted some sysctl counters to debug this), it
starts again by rescanning the same queue (either PQ_INACTIVE or
PQ_ACTIVE), from its head.

To me, it looks bad enough even for inactive pages, but for an active
queue it's a disaster, unless the box is idle.  The point is that, in a
nutshell, the following sequence gets executed when contigmalloc1()
tries to free the page:

   vm_pageout_flush(page)  which calls
        vm_pager_put_pages(page)  which calls
                swap_pager_putpages(page)
                        which sleeps (swwrt).

When the box is not idle, while the process is blocked in swwrt state,
some other process execution will lead to more inactive (some chances)
or active (all the chances) pages added, and then contigmalloc1() starts
scanning a queue again!

>How-To-Repeat:

A program that issues the METEORSETGEO ioctl to bktr driver, with
relatively large number of frames (in my tests I used 14 frames ==
14*768*576*4/4096 == 6049 pages).  The bktr driver did not have
sufficient space preallocated.

For some reason, bktr driver in its get_bktr_mem() function
(dev/bktr/bktr_os.c) first tries to do vm_page_alloc_contig() with the
alignment of 1<<24, and then, if this fails, proceeds with PAGE_SIZE.

[As a side note, I have no idea what is the reason for using such a large
alignment in bktr driver.  Apparently, this piece of code was copied
as is from meteor driver.]

On a practically idle box the allocation fails after 4 to 8 seconds.
The number of jumps from vm_pageout_flush() callpoint in inactive scan
code to PQ_INACTIVE rescan is about 110.  The number of jumps from
vm_pageout_flush() callpoint in active scan code to PQ_INACTIVE rescan
is about 4400.

On a busy box (nice -20 perl -e 'for(;;){}') this takes forever - or at
least I was not patient enough to wait for completion.  The number of
jumps increases at a steady rate, most of them are from the `active'
piece.  In top(1), I observed things like this (please pay attention to
Ks and Ms here):

Mem: 348K Active, 180K Inact, 21M Wired, 38M Cache, 9899K Buf, 64M Free
Swap: 525M Total, 21M Used, 504M Free, 3% Inuse, 1552K Out

>Fix:

A first obvious thing to do is to remove the 1<<24 alignment allocation
from the bktr (and meteor) code.

This helps in my particular case.

However, I think that the internal workings of contigmalloc1() are
seriously broken for large alignments.  My understanding is that the
page freeing code is somewhat of a last resort for the routine, and it
probably should not do that in this case --- the assumption
contigmalloc1() takes is that if the very first loop was not able to
find even the starting page, then there is a severe memory shortage or
something.  Not necessarily so.

To me, the code simply `does not look right'.

And I have no idea what the proper fix might look like.

Cheers,
Anton.


>Release-Note:
>Audit-Trail:
>Unformatted:


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-bugs" in the body of the message