From owner-freebsd-stable@FreeBSD.ORG  Sun Feb  3 17:01:32 2013
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: stable@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 20FC9373;
 Sun,  3 Feb 2013 17:01:32 +0000 (UTC) (envelope-from avg@FreeBSD.org)
Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140])
 by mx1.freebsd.org (Postfix) with ESMTP id 2CA81DC;
 Sun,  3 Feb 2013 17:01:28 +0000 (UTC)
Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua
 [212.40.38.100])
 by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id TAA09311;
 Sun, 03 Feb 2013 19:01:20 +0200 (EET) (envelope-from avg@FreeBSD.org)
Received: from localhost ([127.0.0.1])
 by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD))
 id 1U22wR-0006ke-II; Sun, 03 Feb 2013 19:01:19 +0200
Message-ID: <510E97DC.2010701@FreeBSD.org>
Date: Sun, 03 Feb 2013 19:01:16 +0200
From: Andriy Gapon <avg@FreeBSD.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:17.0) Gecko/20130121 Thunderbird/17.0.2
MIME-Version: 1.0
To: Hiroki Sato <hrs@FreeBSD.org>
Subject: Re: NFS-exported ZFS instability
References: <1914428061.1617223.1357133079421.JavaMail.root@erie.cs.uoguelph.ca>
 <20130102174044.GB82219@kib.kiev.ua>
 <20130104.023244.472910818423317661.hrs@allbsd.org>
 <20130130.064459.2572086065267072.hrs@allbsd.org>
 <510850D1.3090700@FreeBSD.org>
In-Reply-To: <510850D1.3090700@FreeBSD.org>
X-Enigmail-Version: 1.4.6
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: kostikbel@gmail.com, alc@FreeBSD.org, stable@FreeBSD.org,
 rmacklem@uoguelph.ca
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 03 Feb 2013 17:01:32 -0000

on 30/01/2013 00:44 Andriy Gapon said the following:
> on 29/01/2013 23:44 Hiroki Sato said the following:
>>   http://people.allbsd.org/~hrs/FreeBSD/pool-20130130.txt
>>   http://people.allbsd.org/~hrs/FreeBSD/pool-20130130-info.txt
> 
[snip]
> See tid 100153 (arc reclaim thread), tid 100105 (pagedaemon) and tid 100639
> (nfsd in kmem_back).
> 

I decided to write a few more words about this issue.

I think that the root cause of the problem is that ZFS ARC code performs memory
allocations with M_WAITOK while holding some ARC lock(s).

If a thread runs into such an allocation when a system is very low on memory
(even for a very short period of time), then the thread is going to be blocked
(to sleep in more exact terms) in VM_WAIT until a certain amount of memory is
freed.  To be more precise until v_free_count + v_cache_count goes above v_free_min.
And quoting from the report:
db> show page
cnt.v_free_count: 8842
cnt.v_cache_count: 0
cnt.v_inactive_count: 0
cnt.v_active_count: 169
cnt.v_wire_count: 6081952
cnt.v_free_reserved: 7981
cnt.v_free_min: 38435
cnt.v_free_target: 161721
cnt.v_cache_min: 161721
cnt.v_inactive_target: 242581

In this case tid 100639 is the thread:
Tracing command nfsd pid 961 tid 100639 td 0xfffffe0027038920
sched_switch() at sched_switch+0x17a/frame 0xffffff86ca5c9c80
mi_switch() at mi_switch+0x1f8/frame 0xffffff86ca5c9cd0
sleepq_switch() at sleepq_switch+0x123/frame 0xffffff86ca5c9d00
sleepq_wait() at sleepq_wait+0x4d/frame 0xffffff86ca5c9d30
_sleep() at _sleep+0x3d4/frame 0xffffff86ca5c9dc0
kmem_back() at kmem_back+0x1a3/frame 0xffffff86ca5c9e50
kmem_malloc() at kmem_malloc+0x1f8/frame 0xffffff86ca5c9ea0
uma_large_malloc() at uma_large_malloc+0x4a/frame 0xffffff86ca5c9ee0
malloc() at malloc+0x14d/frame 0xffffff86ca5c9f20
arc_get_data_buf() at arc_get_data_buf+0x1f4/frame 0xffffff86ca5c9f60
arc_read_nolock() at arc_read_nolock+0x208/frame 0xffffff86ca5ca010
arc_read() at arc_read+0x93/frame 0xffffff86ca5ca090
dbuf_read() at dbuf_read+0x452/frame 0xffffff86ca5ca150
dmu_buf_hold_array_by_dnode() at dmu_buf_hold_array_by_dnode+0x16a/frame
0xffffff86ca5ca1e0
dmu_buf_hold_array() at dmu_buf_hold_array+0x67/frame 0xffffff86ca5ca240
dmu_read_uio() at dmu_read_uio+0x3f/frame 0xffffff86ca5ca2a0
zfs_freebsd_read() at zfs_freebsd_read+0x3e9/frame 0xffffff86ca5ca3b0
nfsvno_read() at nfsvno_read+0x2db/frame 0xffffff86ca5ca490
nfsrvd_read() at nfsrvd_read+0x3ff/frame 0xffffff86ca5ca710
nfsrvd_dorpc() at nfsrvd_dorpc+0xc9/frame 0xffffff86ca5ca910
nfssvc_program() at nfssvc_program+0x5da/frame 0xffffff86ca5caaa0
svc_run_internal() at svc_run_internal+0x5fb/frame 0xffffff86ca5cabd0
svc_thread_start() at svc_thread_start+0xb/frame 0xffffff86ca5cabe0

Sleeping in VM_WAIT while holding the ARC lock(s) means that other ARC
operations may get blocked.  And pretty much all ZFS I/O goes through the ARC.
So that's why we see all those stuck nfsd threads.

Another factor greatly contributing to the problem is that currently the page
daemon blocks (sleeps) in arc_lowmem (a vm_lowmem hook) waiting for the ARC
reclaim thread to make a pass.  This happens before the page daemon makes its
own pageout pass.

But because tid 100639 holds the ARC lock(s), ARC reclaim thread gets blocked
and can not make any forward progress.  Thus the page daemon also gets blocked.
And thus the page daemon can not free up any pages.


So, this situation is not a true deadlock.  E.g. it is theoretically possible
that some other threads would free some memory at their own will and the
condition would clear up.  But in practice this is highly unlikely.

Some possible resolutions that I can think of.

The best one is probably doing ARC memory allocations without holding any locks.

Also, maybe we should make a rule that no vm_lowmem hooks should sleep.  That
is, arc_lowmem should signal the ARC reclaim thread to do some work, but should
not wait on it.

Perhaps we could also provide a mechanism to mark certain memory allocations as
"special" and use that mechanism for ARC allocations.  So that VM_WAIT unblocks
sooner: in this case we had 8842 free pages (~35MB), but thread 100639 was not
waken up.

I think that ideally we should do something about all the three directions.
But even one of them might turn out to be sufficient.
As I've said, the first one seems to be the most promising, but it would require
some tricky programming (flags and retries?) to move memory allocations out of
locked sections.
-- 
Andriy Gapon