From owner-freebsd-stable@FreeBSD.ORG Fri Feb 21 17:14:55 2014 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 5D4101E7; Fri, 21 Feb 2014 17:14:55 +0000 (UTC) Received: from secure.freebsdsolutions.net (secure.freebsdsolutions.net [69.55.234.48]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 350A4165A; Fri, 21 Feb 2014 17:14:54 +0000 (UTC) Received: from [10.10.1.198] (office.betterlinux.com [199.58.199.60]) (authenticated bits=0) by secure.freebsdsolutions.net (8.14.4/8.14.4) with ESMTP id s1LHEk1E079094 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT); Fri, 21 Feb 2014 12:14:47 -0500 (EST) (envelope-from lists@jnielsen.net) Content-Type: text/plain; charset=iso-8859-2 Mime-Version: 1.0 (Mac OS X Mail 7.1 \(1827\)) Subject: Re: recovering from or increasing timeouts on virtio block device From: John Nielsen In-Reply-To: Date: Fri, 21 Feb 2014 10:15:15 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: <920CC320-1A95-46E2-BB18-B6987805885E@jnielsen.net> <18D133C0-E71B-4E66-A13F-6DC3B1BF620C@FreeBSD.org> <6F4E2014-5489-4055-962C-4DFC6184A18E@jnielsen.net> To: Bryan Venteicher X-Mailer: Apple Mail (2.1827) X-DCC-Etherboy-Metrics: ns1.jnielsen.net 1002; Body=2 Fuz1=2 Fuz2=2 X-Virus-Scanned: clamav-milter 0.97.8 at ns1.jnielsen.net X-Virus-Status: Clean Cc: "freebsd-stable@freebsd.org Stable" X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 21 Feb 2014 17:14:55 -0000 On Feb 18, 2014, at 10:14 AM, Bryan Venteicher = wrote: > On Tue, Feb 18, 2014 at 10:57 AM, John Nielsen = wrote: >> On Feb 18, 2014, at 3:32 AM, Edward Tomasz Napiera=B3a = wrote: >>=20 >> > Wiadomo=B6=E6 napisana przez John Nielsen w dniu 17 lut 2014, o = godz. 21:21: >> >> I run several FreeBSD virtual machines in a Linux KVM environment = with a SAN. The VMs use virtio block storage, and the KVM hosts map the = virtual volumes to targets on the SAN. Occasionally, failover or other = maintenance events on the SAN cause it to be unavailable for 30+ = seconds. When this happens, the FreeBSD VMs have hard failures on the = vtbd* devices, and thereafter any attempted reads or writes return = immediately with an error (even after the SAN is responsive again). The = only way to recover a VM once that happens is to hard boot it. >> >> >> >> Is there any way to adjust the timeouts or enable some kind of = retry for the virtio block devices? It would be nice to be able to = recover gracefully after a SAN event without needing to reboot the VMs. >> > >> > Use gmountver(8) perhaps? >>=20 >> Thanks for the tip (and for writing it :), I haven't encountered that = one before. I will experiment with it but I'm not sure it's a fit for = this particular scenario (at least not by itself). When a SAN event = happens the virtual machine's vtbd0 device doesn't disappear, the = underlying hardware just fails to respond for a long-ish time. I suspect = that the driver gives up after either a certain length of time or number = of errors, but my C driver-fu isn't up to figuring it out exactly. Once = it gives up, any I/O requests to the (still "present") device fail = immediately, and I can't see a way to get the driver to actually try any = (new or old) I/O again. >=20 > The vtbd driver has no internal retry mechanism, and pays no attention = to errors other than report then, and never gives up :) >=20 > It is not clear to me whether IO is getting turned around in FreeBSD = before it reaches the driver, or within the host. Do you continue to see = "hard error ..." messages on the console? Thanks for chiming in. I was in too much of a hurry to get the VM = running again last time the issue appeared to capture any useful log = messages, and of course none of them were committed to disk so nothing = was available following a reboot. I will see what I can get next time it happens and follow up on this = thread again. JN