From owner-freebsd-stable@FreeBSD.ORG Thu Mar 23 20:48:21 2006 Return-Path: X-Original-To: stable@freebsd.org Delivered-To: freebsd-stable@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 2D52216A400; Thu, 23 Mar 2006 20:48:21 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id 298A343D68; Thu, 23 Mar 2006 20:48:18 +0000 (GMT) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.13.4/8.13.4) with ESMTP id k2NKm4fB067645; Thu, 23 Mar 2006 12:48:04 -0800 (PST) Received: (from dillon@localhost) by apollo.backplane.com (8.13.4/8.13.4/Submit) id k2NKm4QL067644; Thu, 23 Mar 2006 12:48:04 -0800 (PST) Date: Thu, 23 Mar 2006 12:48:04 -0800 (PST) From: Matthew Dillon Message-Id: <200603232048.k2NKm4QL067644@apollo.backplane.com> To: Mikhail Teterin References: <200603211607.30372.mi+mx@aldan.algebra.com> <200603211717.34348.mi+mx@aldan.algebra.com> <200603212248.k2LMmTMj006791@apollo.backplane.com> <200603231403.36136.mi+mx@aldan.algebra.com> Cc: alc@freebsd.org, stable@freebsd.org Subject: Re: Reading via mmap stinks (Re: weird bugs with mmap-ing via NFS) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 23 Mar 2006 20:48:21 -0000 :Actually, I can not agree here -- quite the opposite seems true. When running :locally (no NFS involved) my compressor with the `-1' flag (fast, least :effective compression), the program easily compresses faster, than it can :read. : :The Opteron CPU is about 50% idle, *and so is the disk* producing only 15Mb/s. :I guess, despite the noise I raised on this subject a year ago, reading via :mmap continues to ignore the MADV_SEQUENTIONAL and has no other adaptability. : :Unlike read, which uses buffering, mmap-reading still does not pre-fault the :file's pieces in efficiently :-( : :Although the program was written to compress files, that are _likely_ still in :memory, when used with regular files, it exposes the lack of mmap :optimization. : :This should be even more obvious, if you time searching for a string in a :large file using grep vs. 'grep --mmap'. : :Yours, : : -mi : :http://aldan.algebra.com/~mi/mzip.c Well, I don't know about FreeBSD, but both grep cases work just fine on DragonFly. I can't test mzip.c because I don't see the compression library you are calling (maybe that's a FreeBSD thing). The results of the grep test ought to be similar for FreeBSD since the heuristic used by both OS's is the same. If they aren't, something might have gotten nerfed accidently in the FreeBSD tree. Here is the cache case test. mmap is clearly faster (though I would again caution that this should not be an implicit assumption since VM fault overheads can rival read() overheads, depending on the situation). The 'x1' file in all tests below is simply /usr/share/dict/words concactenated over and over again to produce a large file. crater# ls -la x1 -rw-r--r-- 1 root wheel 638228992 Mar 23 11:36 x1 [ machine has 1GB of ram ] crater# time grep --mmap asdfasf x1 1.000u 0.117s 0:01.11 100.0% 10+40k 0+0io 0pf+0w crater# time grep --mmap asdfasf x1 0.976u 0.132s 0:01.13 97.3% 10+40k 0+0io 0pf+0w crater# time grep --mmap asdfasf x1 0.984u 0.140s 0:01.11 100.9% 10+41k 0+0io 0pf+0w crater# time grep asdfasf x1 0.601u 0.781s 0:01.40 98.5% 10+42k 0+0io 0pf+0w crater# time grep asdfasf x1 0.507u 0.867s 0:01.39 97.8% 10+40k 0+0io 0pf+0w crater# time grep asdfasf x1 0.562u 0.812s 0:01.43 95.8% 10+41k 0+0io 0pf+0w crater# iostat 1 [ while grep is running, in order to test the cache case and verify that no I/O is occuring once the data has been cached ] The disk I/O case, which I can test by unmounting and remounting the partition containing the file in question, then running grep, seems to be well optimized on DragonFly. It should be similarly optimized on FreeBSD since the code that does this optimization is nearly the same. In my test, it is clear that the page-fault overhead in the uncached case is considerably greater then the copying overhead of a read(), though not by much. And I would expect that, too. test28# umount /home test28# mount /home test28# time grep asdfasdf /home/x1 0.382u 0.351s 0:10.23 7.1% 55+141k 42+0io 4pf+0w test28# umount /home test28# mount /home test28# time grep asdfasdf /home/x1 0.390u 0.367s 0:10.16 7.3% 48+123k 42+0io 0pf+0w test28# umount /home test28# mount /home test28# time grep --mmap asdfasdf /home/x1 0.539u 0.265s 0:10.53 7.5% 36+93k 42+0io 19518pf+0w test28# umount /home test28# mount /home test28# time grep --mmap asdfasdf /home/x1 0.617u 0.289s 0:10.47 8.5% 41+105k 42+0io 19518pf+0w test28# test28# iostat 1 during the test showed ~60MBytes/sec for all four tests Perhaps you should post specifics of the test you are running, as well as specifics of the results you are getting, such as the actual timing output instead of a human interpretation of the results. For that matter, being an opteron system, were you running the tests on a UP system or an SMP system? grep is a single-threaded so on a 2-cpu system it will show 50% cpu utilization since one cpu will be saturated and the other idle. With specifics, a FreeBSD person can try to reproduce your test results. A grep vs grep --mmap test is pretty straightforward and should be a good test of the VM read-ahead code, but there might always be some unknown circumstance specific to a machine configuration that is the cause of the problem. Repeatability and reproducability by third parties is important when diagnosing any problem. Insofar as MADV_SEQUENTIAL goes... you shouldn't need it on FreeBSD. Unless someone ripped it out since I committed it many years ago, which I doubt, FreeBSD's VM heuristic will figure out that the accesses are sequential and start issuing read-aheads. It should pre-fault, and it should do read-ahead. That isn't to say that there isn't a bug, just that everyone interested in the problem has to be able to reproduce it and help each other track down the source. Just making an assumption and accusation with regards to the cause of the problem doesn't solve it. The VM system is rather fragile when it comes to read-ahead because the only way to do read-ahead on mapped memory is to issue the read-ahead and then mark some prior (already cached) page as inaccessible in order to be able to take a VM fault and issue the NEXT read-ahead before the program exhausts the current cached data. It is, in fact, rather complex code, not straightforward as you might expect. But I can only caution you, again, on making the assumption that the operating system should optimize your particular test case intuitively, like a human would. Operating systems generaly optimize the most common cases, but it would be pretty dumb to actually try to make them optimize every conceivable case. You would wind up with hundreds of thousands of lines of barely exercised and likely buggy code. -Matt