From owner-freebsd-fs@FreeBSD.ORG Wed Mar 28 09:52:18 2012 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 01A2B1065680 for ; Wed, 28 Mar 2012 09:52:18 +0000 (UTC) (envelope-from peter.maloney@brockmann-consult.de) Received: from moutng.kundenserver.de (moutng.kundenserver.de [212.227.17.9]) by mx1.freebsd.org (Postfix) with ESMTP id 7D6668FC08 for ; Wed, 28 Mar 2012 09:52:17 +0000 (UTC) Received: from [10.3.0.26] ([141.4.215.32]) by mrelayeu.kundenserver.de (node=mreu0) with ESMTP (Nemesis) id 0Lr6Rz-1ShCDy39pj-00efGj; Wed, 28 Mar 2012 11:52:09 +0200 Message-ID: <4F72DF48.6040108@brockmann-consult.de> Date: Wed, 28 Mar 2012 11:52:08 +0200 From: Peter Maloney User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.2) Gecko/20120216 Thunderbird/10.0.2 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Provags-ID: V02:K0:6mMhQujBIhb1pKponF+3Om1Wgn1EsVCdYSui6ekb2rX rDFbE2gZ176b9E33VZHBkT+kaRuEgy97bY3QROn0I/LO2JfcrF sYwjGzr0HaVQWPitvModSZIzUTh3Uz8mL7MGgcS1plcFaQN9ww gvviWoYyBzVHC2A0CMaY+DQHjar+CKfNRqetZTf+2RD2QocQnZ Jk6joMEp0Nix3ynV26UIw/1yMjeinodqc2P5c9BZMSY+tdZwyV PKOexegIbKKjQjEo5aXAp1HGhsUrvBqmIwwAe3Jya1erdGoHZr mYz22tQoR0Mx3bRn7kw+aOFCSVqBhjPiXgFmmatQhR+cwGoVqw FE4h2TuRsD5ztL4l3nwOksnjoGK0OXbIZfeXq4Pry Subject: Strange ZFS deadlock today X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 28 Mar 2012 09:52:18 -0000 I believe it is caused by ZFS, since the state looks like the code in this page about ARC: [url]http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/[/url] To cause/reveal this problem, first, I tried to run something with OpenJDK 6. In one window: [CODE]# ./CopyFromHd.sh load: 0.00 cmd: java 51239 [ucond] 44.56r 0.18u 0.06s 0% 19588k load: 0.00 cmd: java 51239 [ucond] 46.22r 0.18u 0.06s 0% 19588k load: 0.00 cmd: java 51239 [ucond] 51.14r 0.18u 0.06s 0% 19588k load: 0.00 cmd: java 51239 [ucond] 52.90r 0.18u 0.06s 0% 19588k ^C load: 0.00 cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock] 58.35r 0.18u 0.06s 0% 19872k load: 0.00 cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock] 61.73r 0.18u 0.06s 0% 19872k ^C load: 0.00 cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock] 89.67r 0.18u 0.06s 0% 19872k load: 0.00 cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock] 141.07r 0.18u 0.06s 0% 19872k ^C[/CODE] Then since it was hung, I thought maybe it was zfs' fault rather than OpenJDK, so tried with du. Second window: [CODE] # du -shx /data/archive2/2011/09/11/x 3.1G /data/archive2/2011/09/11/x # du -shx /data/archive2/2011/09/11 24G /data/archive2/2011/09/11 # du -shx /data/archive2/2011/ load: 0.00 cmd: du 72503 [buf_hash_table.ht_locks[i].ht_lock] 13.75r 0.00u 0.00s 0% 1012k ^C^C^C^Z^C load: 0.00 cmd: du 72503 [buf_hash_table.ht_locks[i].ht_lock] 221.97r 0.00u 0.00s 0% 1012k [/CODE] Then I thought I could just run the simplest version of the program (which does pretty much no IO) Third window: [CODE]# ./CopyFromHd.sh --help ^C^C^C^C^C^C load: 0.00 cmd: java 52339 [suspended] 26.33r 0.15u 0.04s 0% 25644k load: 0.00 cmd: java 52339 [suspended] 27.38r 0.15u 0.04s 0% 25644k ^C load: 0.00 cmd: java 52339 [suspended] 285.23r 0.15u 0.04s 0% 25644k ^Z [1]+ Stopped ./CopyFromHd.sh # jobs -p 51988 # kill -9 51988 # jobs -p 51988 [1]+ Killed: 9 ./CopyFromHd.sh # jobs -p [/CODE] Back to the first window: [CODE]load: 0.00 cmd: java 51239 [buf_hash_table.ht_locks[i].ht_lock] 459.38r 0.18u 0.06s 0% 19872k ^C^Z [1]+ Stopped ./CopyFromHd.sh ... # jobs -p 51128 # kill -9 51128 # jobs -p 51128 [1]+ Interrupt: 2 ./CopyFromHd.sh ... # jobs -p # ps axl | grep java 0 51239 1 0 44 0 1264940 19904 - T 0 0:00.25 [java] 0 76933 77797 0 44 0 9124 1180 piperd S+ 0 0:00.00 grep java 0 52339 1 0 44 0 1266988 25676 - T 1 0:00.20 /usr/local/openjdk6 //bin/java -Xmx1024M -classpath ... [/CODE] The java above is the "--help" one from the 2nd window. So I guess the first one ended. But "dd" commands I ran are all stuck, and so is the 2nd java, and also "zdb" "zdb zroot" "zdb tank" and "zdb data" are all stuck. Also running "find" on either tank or data hangs. [CMD="#"]zpool iostat[/CMD] [CMD="#"]zpool status [-v][/CMD] and [CMD="#"]zfs list[/CMD] all run without hanging. Then I thought maybe I would disable primarycache to see what happens. [CODE]# zfs set primarycache=none data # zfs set primarycache=none tank load: 0.00 cmd: zfs 80750 [tx->tx_sync_done_cv)] 5.73r 0.00u 0.00s 0% 1636k (hang) ^Z^C load: 0.00 cmd: zfs 80750 [tx->tx_sync_done_cv)] 87.28r 0.00u 0.00s 0% 1636k [/CODE] Another window: [CODE]# zfs get primarycache data NAME PROPERTY VALUE SOURCE data primarycache none local # zfs get primarycache tank NAME PROPERTY VALUE SOURCE tank primarycache none local[/CODE] On a Linux server that nfs mounts /data on this FreeBSD server, [CMD="#"]df[/CMD] hangs at the point where the nfs mount should be listed. (So I have to reboot now rather than later) I have not run memtest on this machine. I have no l2arc. Uptime is 22 days. It is normally perfectly stable. I ran CopyFromHd to process over 20 TB of data so far (2 weeks ago), so it is not normal for this to happen. Other similar servers don't hang this way. (most hangs I've caused in the past are not in this "buf_hash_table.ht_locks[i].ht_lock" state. The last thing I did before this was running "zdb -S tank" yesterday. (tank is not the same pool as data) FreeBSD version is 8.2-STABLE csupped on 2012-02-04. And then on restart, the console showed: "some processes would not die; ps axl advised" and "shutdown terminated abnormally"