From owner-freebsd-net@FreeBSD.ORG Fri Oct 15 20:06:29 2010 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9DA78106566C; Fri, 15 Oct 2010 20:06:29 +0000 (UTC) (envelope-from yanegomi@gmail.com) Received: from mail-iw0-f182.google.com (mail-iw0-f182.google.com [209.85.214.182]) by mx1.freebsd.org (Postfix) with ESMTP id 2DE4F8FC1B; Fri, 15 Oct 2010 20:06:28 +0000 (UTC) Received: by iwn6 with SMTP id 6so811244iwn.13 for ; Fri, 15 Oct 2010 13:06:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:sender:received :in-reply-to:references:date:x-google-sender-auth:message-id:subject :from:to:cc:content-type:content-transfer-encoding; bh=iTz6zF4dhgB8tpOCDrEMJj4b/uhGitBEZaEW9uWTgOU=; b=FDy8gw2qnyDzle46MnzDchKOooTzXHf0wk+4I0Qw0Z5euBRwEjO34FYDT9Bbqzy7d0 NzWP2RVOV+k/jUBH1mIIDy3q24OsXwHFeGdNd8JZcFLQtA8T+/WRK9NOiQE7sLp9xSaN pX0+2nrkBfWkYXnejmZhL0djBEZWJWI/luxCw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=cQUFAGyg84IZg+3As0GhN6Oj0UWrbKmJARmk080YUu5Ha88YF2Q3JFETj3SrIXpUSq bxbnmUMAESGeIkJ3Ymdp2D5q7OVnqao75UrhgMoZHr5x7TkgaHo0lULFop70rGWM6oLl R0Xbgil4qYNKWj/Ro7fGj6MnNoseRuKbiQkO0= MIME-Version: 1.0 Received: by 10.231.144.74 with SMTP id y10mr1143822ibu.65.1287171551591; Fri, 15 Oct 2010 12:39:11 -0700 (PDT) Sender: yanegomi@gmail.com Received: by 10.231.184.3 with HTTP; Fri, 15 Oct 2010 12:39:11 -0700 (PDT) In-Reply-To: References: <15387E38-1E6C-4347-BEA1-61AEE31B5544@freebsd.org> Date: Fri, 15 Oct 2010 12:39:11 -0700 X-Google-Sender-Auth: 6NTs1zcDiFapHTGRGuOeHVzConA Message-ID: From: Garrett Cooper To: Robert Watson Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: FreeBSD Current , freebsd-net@freebsd.org, Attilio Rao , Sergey Kandaurov , Jack F Vogel , Ryan Stone , Ryan Stone , Ed Maste Subject: Re: [PATCH] Netdump for review and testing -- preliminary version X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Oct 2010 20:06:29 -0000 On Fri, Oct 15, 2010 at 12:25 PM, Robert Watson wrote= : > On Thu, 14 Oct 2010, Attilio Rao wrote: > >>> No, what I'm saying is: UMA needs to not call its drain handlers, and >>> ideally not call into VM to fill slabs, from the dumping context. That'= s >>> easy to implement and will cause the dump to fail rather than causing t= he >>> system to hang. >> >> My point is, however, still the same: that should not happen just for th= e >> netdump specific case but for all the dumping/KDB/panic cases (I know it= is >> unlikely current code !netdump calls into UMA but it is not an establish= ed >> pre-requisite and may still happen that some added code does). I still s= ee >> this as a weakness on the infrastructure, independently from netdump. I = can >> see that your point is that it is vital to netdump correct behaviour tho= ugh, >> so I'd wonder if it worths fixing it now or later. > > Quite a bit of our kernel and dumping infrastructure special cases debugg= ing > and dumping behavior to avoid sources of non-robustness. =A0For example, > serial drivers avoid locking, and for disk dumps we bypass GEOM to avoid = the > memory allocation, freeing, and threading that it depends on. > > The goal here is to be robust when handling dumps: hanging is worse than = not > dumping, since you won't get the dump either way, and if you don't reboot > then the system requires manual intervention to recover. =A0Example of th= ings > that are critical to avoid include: > > - The dumping thread tripping over locks held by the panicked thread, or = by > =A0another now-suspended thread, leading to deadlock against a suspended > =A0thread. > > - Corrupting dumps by increasing concurrency in the panic case. =A0We ran= into > a > =A0case a year or two ago where changing VM state during the dump on amd6= 4 > =A0caused file system corruption as the dump code assumed that the space > =A0required for a dump didn't change while dumping took place. > > Any code dependency we add in the panic / KDB / dump path is one more ris= k > that we don't successfully dump and reboot, so we need to minimize that > code. But there are already some cases that aren't properly handled today in the ddb area dealing with dumping that aren't handled properly. Take for instance the following two scenarios: 1. Call doadump twice from the debugger. 2. Call doadump, exit the debugger, reenter the debugger, and call doadump again. Both of these scenarios hang reliably for me. I'm not saying that we should regress things further, but I'm just noting that there are most likely a chunk of edgecases that aren't being handled properly when doing dumps that could be handled better / fixed. Thanks, -Garrett