From owner-freebsd-stable@FreeBSD.ORG Fri Mar 24 20:38:29 2006 Return-Path: X-Original-To: stable@freebsd.org Delivered-To: freebsd-stable@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 76F6D16A423; Fri, 24 Mar 2006 20:38:29 +0000 (UTC) (envelope-from mi+mx@aldan.algebra.com) Received: from aldan.algebra.com (aldan.algebra.com [216.254.65.224]) by mx1.FreeBSD.org (Postfix) with ESMTP id 526D943D78; Fri, 24 Mar 2006 20:38:12 +0000 (GMT) (envelope-from mi+mx@aldan.algebra.com) Received: from corbulon.video-collage.com (static-151-204-231-237.bos.east.verizon.net [151.204.231.237]) by aldan.algebra.com (8.13.6/8.13.6) with ESMTP id k2OKcAWP002723 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Fri, 24 Mar 2006 15:38:11 -0500 (EST) (envelope-from mi+mx@aldan.algebra.com) Received: from mteterin.us.murex.com (195-11.customer.cloud9.net [168.100.195.11]) by corbulon.video-collage.com (8.13.6/8.13.6) with ESMTP id k2OKY36F052689 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 24 Mar 2006 15:38:05 -0500 (EST) (envelope-from mi+mx@aldan.algebra.com) Received: from mteterin.us.murex.com (mteterin@localhost [127.0.0.1]) by mteterin.us.murex.com (8.13.4/8.13.4) with ESMTP id k2OKI3QH044254; Fri, 24 Mar 2006 15:18:03 -0500 (EST) (envelope-from mi+mx@aldan.algebra.com) Received: from localhost (localhost [[UNIX: localhost]]) by mteterin.us.murex.com (8.13.4/8.13.4/Submit) id k2OKI1DV044253; Fri, 24 Mar 2006 15:18:01 -0500 (EST) (envelope-from mi+mx@aldan.algebra.com) X-Authentication-Warning: mteterin.us.murex.com: mteterin set sender to mi+mx@aldan.algebra.com using -f From: Mikhail Teterin Organization: Virtual Estates, Inc. To: Bakul Shah Date: Fri, 24 Mar 2006 15:18:00 -0500 User-Agent: KMail/1.8.3 References: <200603232352.k2NNqPS8018729@gate.bitblocks.com> In-Reply-To: <200603232352.k2NNqPS8018729@gate.bitblocks.com> MIME-Version: 1.0 Content-Type: Multipart/Mixed; boundary="Boundary-00=_5PFJETjmUpqqCYh" Message-Id: <200603241518.01027.mi+mx@aldan.algebra.com> X-Scanned-By: MIMEDefang 2.43 Cc: alc@freebsd.org, Peter Jeremy , stable@freebsd.org Subject: Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 24 Mar 2006 20:38:29 -0000 --Boundary-00=_5PFJETjmUpqqCYh Content-Type: text/plain; charset="koi8-u" Content-Transfer-Encoding: 8bit Content-Disposition: inline Matthew Dillon wrote: > It is possible that the kernel believes the VM system to be too loaded > to issue read-aheads, as a consequence of your blowing out of the system > caches. See attachment for the snapshot of `systat 1 -vm' -- it stays like that for the most of the compression run time with only occasional flushes to the amrd0 device (the destination for the compressed output). Bakul Shah followed up: > May be the OS needs "reclaim-behind" for the sequential case? > This way you can mmap many many pages and use a much smaller > pool of physical pages to back them. šThe idea is for the VM > to reclaim pages N-k..N-1 when page N is accessed and allow > the same process to reuse this page. Although it may hard for the kernel to guess, which pages it can reclaim efficiently in the general case, my issuing of madvise with MADV_SEQUENTIONAL should've given it a strong hint. It is for this reasons, that I very much prefer the mmap API to read/write (against Matt's repeated advice) -- there is a way to advise the kernel, which there is not with the read. Read also requires fairly large buffers in the user space to be efficient -- *in addition* to the buffers in the kernel. Managing such buffers properly makes the program far messier _and_ OS-dependent, than using the mmap interface has to be. I totally agree with Matt, that FreeBSD's (and probably DragonFly's too) mmap interface is better than others', but, it seems to me, there is plenty of room for improvement. Reading via mmap should never be slower, than via read -- it should be just a notch faster, in fact... I'm also quite certain, that fulfulling my "demands" would add quite a bit of complexity to the mmap support in kernel, but hey, that's what the kernel is there for :-) Unlike grep, which seems to use only 32k buffers anyway (and does not use madvise -- see attachment), my program mmaps gigabytes of the input file at once, trusting the kernel to do a better job at reading the data in the most efficient manner :-) Peter Jeremy wrote: > On an amd64 system running about 6-week old -stable, both ['grep' and 'grep > --mmap' -mi] behave pretty much identically. Peter, I read grep's source -- it is not using madvise (because it hurts performance on SunOS-4.1!) and reads in chunks of 32k anyway. Would you care to look at my program instead? Thanks: http://aldan.algebra.com/mzip.c (link with -lz and -lbz2). Matthew Dillon wrote: [...] > If the times for the mmap case do not blow up, we are back to square > one and I would start investigating the disk driver that Mikhail is > using. On the machine, where both mzip and the disk run at only 50%, the disk is a plain SATA drive (mzip's state goes from "RUN" to "vnread" and back). Thanks, everyone! -mi --Boundary-00=_5PFJETjmUpqqCYh Content-Type: text/x-diff; charset="koi8-u"; name="grep.diff" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="grep.diff" Index: grep.c =================================================================== RCS file: /home/ncvs/src/gnu/usr.bin/grep/grep.c,v retrieving revision 1.31.2.1 diff -U2 -r1.31.2.1 grep.c --- grep.c 26 Oct 2005 21:13:30 -0000 1.31.2.1 +++ grep.c 24 Mar 2006 19:52:05 -0000 @@ -427,9 +427,8 @@ PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_FIXED, bufdesc, bufoffset) - != (caddr_t) -1)) + != MAP_FAILED)) { - /* Do not bother to use madvise with MADV_SEQUENTIAL or - MADV_WILLNEED on the mmapped memory. One might think it - would help, but it slows us down about 30% on SunOS 4.1. */ + if (madvise(readbuf, mmapsize, MADV_SEQUENTIAL)) + warn("madvise"); fillsize = mmapsize; } @@ -441,4 +440,6 @@ other process has an advisory read lock on the file. There's no point alarming the user about this misfeature. */ + if (mmapsize) + warn("mmap"); bufmapped = 0; if (bufoffset != initial_bufoffset --Boundary-00=_5PFJETjmUpqqCYh Content-Type: text/plain; charset="koi8-u"; name="vmstat.txt" Content-Transfer-Encoding: 8bit Content-Disposition: attachment; filename="vmstat.txt" 18 users Load 0.46 0.53 0.60 24 ÂÅÒ 15:15 Mem:KB REAL VIRTUAL VN PAGER SWAP PAGER Tot Share Tot Share Free in out in out Act 1833864 5880 27758552 45268 92216 count 240 All 1881188 5992 1432466k 52864 pages 3413 Interrupts Proc:r p d s w Csw Trp Sys Int Sof Flt cow 2252 total 1 2101 1605 2025 197 422 2 2018 251432 wire irq1: atkb 506156 act irq6: fdc0 3.0%Sys 0.0%Intr 45.2%User 0.0%Nice 51.9%Idl 1038216 inact irq15: ata | | | | | | | | | | 89252 cache irq17: fwo =>>>>>>>>>>>>>>>>>>>>>>> 2964 free irq20: nve daefr irq21: ohc Namei Name-cache Dir-cache prcfr 241 irq22: ehc Calls hits % hits % 951 react 11 irq25: em0 pdwak irq29: amr 618 zfod pdpgs 2000 cpu0: time Disks ad4 amrd0 ofod intrn KB/t 56.79 0.00 %slo-z 200816 buf tps 241 0 5143 tfree 8 dirtybuf MB/s 13.38 0.00 100000 desiredvnodes % busy 47 0 34717 numvnodes 24991 freevnodes --Boundary-00=_5PFJETjmUpqqCYh--