From owner-freebsd-stable@FreeBSD.ORG Thu Nov 21 11:57:06 2013 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 742CDB1A for ; Thu, 21 Nov 2013 11:57:06 +0000 (UTC) Received: from constantine.ingresso.co.uk (constantine.ingresso.co.uk [IPv6:2a02:b90:3002:e550::3]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 3405B217B for ; Thu, 21 Nov 2013 11:57:06 +0000 (UTC) Received: from dilbert.london-internal.ingresso.co.uk ([10.64.50.6] helo=dilbert.ingresso.co.uk) by constantine.ingresso.co.uk with esmtps (TLSv1:DHE-RSA-AES256-SHA:256) (Exim 4.80.1 (FreeBSD)) (envelope-from ) id 1VjSsY-0009xT-IW for freebsd-stable@freebsd.org; Thu, 21 Nov 2013 11:57:02 +0000 Received: from petefrench by dilbert.ingresso.co.uk with local (Exim 4.80.1 (FreeBSD)) (envelope-from ) id 1VjSsY-000PXy-GC for freebsd-stable@freebsd.org; Thu, 21 Nov 2013 11:57:02 +0000 To: freebsd-stable@freebsd.org Subject: Hast locking up under 9.2 Message-Id: From: Pete French Date: Thu, 21 Nov 2013 11:57:02 +0000 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.16 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 21 Nov 2013 11:57:06 -0000 I have had to (hopefully temprarily) disable hats on our systems as under 9.2 I am finding that it locks up under high disc load. This has only sarted being a problem after we moved from 8-STABLE to 9-STABLE, there was no locking up before. I have a zpool on top of the hast devices - I did have two hast devices, but the problem still occurs with a single device. the symptoms are that I see the 'dirty" count on the master sidetick at 2.0 megs and not change, the number of writes does not change, and if I usse a "sync" command at the command line it never returns - there is no disc activity on eiher the primary or the secondary side. If I leave it like this it will eventually freeze the whole machine, but usually if I see this happening I reboot the stuck machine. This only happens under high levels of disc activity (in this case modifying a mysql table from myisan to inndb - causes a few gig of copies). However it is not simply high disc activity as I can resilver the ZFS pool quite happily without problems. Frustratingly I have a similar setup on a test pair of machines, but I cannot reporduce the problem there. I dont have any useful debugging unfortunately, and I do realise thart "it locks up" is unhelpful! The only thing I see in the syslog are a statements like this: Nov 14 13:51:59 serpentine-active hastd[1258]: [serp1] (primary) Worker process killed (pid=1520, signal=6). Nov 14 13:51:59 serpentine-passive hastd[14307]: [serp1] (secondary) Worker process exited ungracefully (pid=14638, exitcode=75). Thats about all the nfo I have - currently I have taken hast out of the stack and am tryying to cobble something together manually using iscsi, but I would prefer to go back to hast if possible. Has anyone seen anythign similar, or have any suggestions ? thanks, -pete.