From owner-freebsd-stable@FreeBSD.ORG Sun Feb 3 17:01:32 2013 Return-Path: Delivered-To: stable@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 20FC9373; Sun, 3 Feb 2013 17:01:32 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 2CA81DC; Sun, 3 Feb 2013 17:01:28 +0000 (UTC) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id TAA09311; Sun, 03 Feb 2013 19:01:20 +0200 (EET) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1U22wR-0006ke-II; Sun, 03 Feb 2013 19:01:19 +0200 Message-ID: <510E97DC.2010701@FreeBSD.org> Date: Sun, 03 Feb 2013 19:01:16 +0200 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130121 Thunderbird/17.0.2 MIME-Version: 1.0 To: Hiroki Sato Subject: Re: NFS-exported ZFS instability References: <1914428061.1617223.1357133079421.JavaMail.root@erie.cs.uoguelph.ca> <20130102174044.GB82219@kib.kiev.ua> <20130104.023244.472910818423317661.hrs@allbsd.org> <20130130.064459.2572086065267072.hrs@allbsd.org> <510850D1.3090700@FreeBSD.org> In-Reply-To: <510850D1.3090700@FreeBSD.org> X-Enigmail-Version: 1.4.6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: kostikbel@gmail.com, alc@FreeBSD.org, stable@FreeBSD.org, rmacklem@uoguelph.ca X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 03 Feb 2013 17:01:32 -0000 on 30/01/2013 00:44 Andriy Gapon said the following: > on 29/01/2013 23:44 Hiroki Sato said the following: >> http://people.allbsd.org/~hrs/FreeBSD/pool-20130130.txt >> http://people.allbsd.org/~hrs/FreeBSD/pool-20130130-info.txt > [snip] > See tid 100153 (arc reclaim thread), tid 100105 (pagedaemon) and tid 100639 > (nfsd in kmem_back). > I decided to write a few more words about this issue. I think that the root cause of the problem is that ZFS ARC code performs memory allocations with M_WAITOK while holding some ARC lock(s). If a thread runs into such an allocation when a system is very low on memory (even for a very short period of time), then the thread is going to be blocked (to sleep in more exact terms) in VM_WAIT until a certain amount of memory is freed. To be more precise until v_free_count + v_cache_count goes above v_free_min. And quoting from the report: db> show page cnt.v_free_count: 8842 cnt.v_cache_count: 0 cnt.v_inactive_count: 0 cnt.v_active_count: 169 cnt.v_wire_count: 6081952 cnt.v_free_reserved: 7981 cnt.v_free_min: 38435 cnt.v_free_target: 161721 cnt.v_cache_min: 161721 cnt.v_inactive_target: 242581 In this case tid 100639 is the thread: Tracing command nfsd pid 961 tid 100639 td 0xfffffe0027038920 sched_switch() at sched_switch+0x17a/frame 0xffffff86ca5c9c80 mi_switch() at mi_switch+0x1f8/frame 0xffffff86ca5c9cd0 sleepq_switch() at sleepq_switch+0x123/frame 0xffffff86ca5c9d00 sleepq_wait() at sleepq_wait+0x4d/frame 0xffffff86ca5c9d30 _sleep() at _sleep+0x3d4/frame 0xffffff86ca5c9dc0 kmem_back() at kmem_back+0x1a3/frame 0xffffff86ca5c9e50 kmem_malloc() at kmem_malloc+0x1f8/frame 0xffffff86ca5c9ea0 uma_large_malloc() at uma_large_malloc+0x4a/frame 0xffffff86ca5c9ee0 malloc() at malloc+0x14d/frame 0xffffff86ca5c9f20 arc_get_data_buf() at arc_get_data_buf+0x1f4/frame 0xffffff86ca5c9f60 arc_read_nolock() at arc_read_nolock+0x208/frame 0xffffff86ca5ca010 arc_read() at arc_read+0x93/frame 0xffffff86ca5ca090 dbuf_read() at dbuf_read+0x452/frame 0xffffff86ca5ca150 dmu_buf_hold_array_by_dnode() at dmu_buf_hold_array_by_dnode+0x16a/frame 0xffffff86ca5ca1e0 dmu_buf_hold_array() at dmu_buf_hold_array+0x67/frame 0xffffff86ca5ca240 dmu_read_uio() at dmu_read_uio+0x3f/frame 0xffffff86ca5ca2a0 zfs_freebsd_read() at zfs_freebsd_read+0x3e9/frame 0xffffff86ca5ca3b0 nfsvno_read() at nfsvno_read+0x2db/frame 0xffffff86ca5ca490 nfsrvd_read() at nfsrvd_read+0x3ff/frame 0xffffff86ca5ca710 nfsrvd_dorpc() at nfsrvd_dorpc+0xc9/frame 0xffffff86ca5ca910 nfssvc_program() at nfssvc_program+0x5da/frame 0xffffff86ca5caaa0 svc_run_internal() at svc_run_internal+0x5fb/frame 0xffffff86ca5cabd0 svc_thread_start() at svc_thread_start+0xb/frame 0xffffff86ca5cabe0 Sleeping in VM_WAIT while holding the ARC lock(s) means that other ARC operations may get blocked. And pretty much all ZFS I/O goes through the ARC. So that's why we see all those stuck nfsd threads. Another factor greatly contributing to the problem is that currently the page daemon blocks (sleeps) in arc_lowmem (a vm_lowmem hook) waiting for the ARC reclaim thread to make a pass. This happens before the page daemon makes its own pageout pass. But because tid 100639 holds the ARC lock(s), ARC reclaim thread gets blocked and can not make any forward progress. Thus the page daemon also gets blocked. And thus the page daemon can not free up any pages. So, this situation is not a true deadlock. E.g. it is theoretically possible that some other threads would free some memory at their own will and the condition would clear up. But in practice this is highly unlikely. Some possible resolutions that I can think of. The best one is probably doing ARC memory allocations without holding any locks. Also, maybe we should make a rule that no vm_lowmem hooks should sleep. That is, arc_lowmem should signal the ARC reclaim thread to do some work, but should not wait on it. Perhaps we could also provide a mechanism to mark certain memory allocations as "special" and use that mechanism for ARC allocations. So that VM_WAIT unblocks sooner: in this case we had 8842 free pages (~35MB), but thread 100639 was not waken up. I think that ideally we should do something about all the three directions. But even one of them might turn out to be sufficient. As I've said, the first one seems to be the most promising, but it would require some tricky programming (flags and retries?) to move memory allocations out of locked sections. -- Andriy Gapon