From owner-freebsd-net@FreeBSD.ORG Thu Oct 9 21:31:35 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 2EFFFD9A; Thu, 9 Oct 2014 21:31:35 +0000 (UTC) Received: from mail-wg0-x22b.google.com (mail-wg0-x22b.google.com [IPv6:2a00:1450:400c:c00::22b]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 994A2E85; Thu, 9 Oct 2014 21:31:34 +0000 (UTC) Received: by mail-wg0-f43.google.com with SMTP id m15so2388074wgh.26 for ; Thu, 09 Oct 2014 14:31:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=VI5z8XPfoURwbE9Go+dXCUDzwx6YvyyXMQ33fdnpx8Q=; b=AD6WXMroqK895JrLhRkLPhayzplWznC1kxufDOIiYb9aK/kgAYZvi98assK6miLtox /0WhA/GOb2Uc7KAK8vPME4WkmsH71mSR+Qg6OhSdUAHH03MzdKzeddw95JE2fwR2IYQf xDjVJYh9CCcpffGj37ITTtyoA679lyJgozyfluOAhsGewC/yxnMJoQDlG612oqJP+MBP XQCqae5GH207wbUWFSpuUQgG7gJbu5p8NLB9Uh+Ev2RbcI9Tt7CBaTktoC/1OVHf6LvK kV5xBahPKlrm1+z5IxXXHkkhKs2s9zbu/aRFosg3WQDYEay7uH1PQFqGaOi3yznHDPWl pDBA== MIME-Version: 1.0 X-Received: by 10.180.79.41 with SMTP id g9mr479780wix.75.1412890292641; Thu, 09 Oct 2014 14:31:32 -0700 (PDT) Received: by 10.217.67.201 with HTTP; Thu, 9 Oct 2014 14:31:32 -0700 (PDT) In-Reply-To: <2077446.sYcZo9xEXb@ralph.baldwin.cx> References: <1410203348.1343.1.camel@bruno> <201410071428.15753.jhb@freebsd.org> <2077446.sYcZo9xEXb@ralph.baldwin.cx> Date: Thu, 9 Oct 2014 14:31:32 -0700 Message-ID: Subject: Re: ixgbe(4) spin lock held too long From: Jason Wolfe To: John Baldwin Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.18-1 Cc: Sean Bruno , freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 09 Oct 2014 21:31:35 -0000 On Wed, Oct 8, 2014 at 12:29 PM, John Baldwin wrote: > My only other thought is if a direct timeout routine ran for a long time. > > I just committed a change to current that can let you capture KTR traces of > callout routines for use with schedgraph (r272757). Unfortunately, > enabling KTR_SCHED can be slow. If you are up for trying it, I do think > that > building a kernel with KTR and KTR_SCHED enabled (KTR_COMPILE=KTR_SCHED and > KTR_MASK=KTR_SCHED) with the kernel part of the commit I referenced above > applied is the best bet to figure out why it is spinning so long. If you > can > try that (and if it doesn't slow things down too much) and get a panic with > those options enabled, then capture the output of > 'ktrdump -e /path/to/kernel -m /var/crash/vmcore.X -ct', we can use > Src/tools/sched/schedgraph.py to look at that output to get a picture of > what > was going on just prior to the crash. > > -- > John Baldwin > As luck would have it.. caught one of the boxes with the new KTR code/options after only an hour. Crashed in the same way w tid 100003 and looking the same in callout_process spin lock 0xffffffff81262d00 (callout) held by 0xfffff800151fe000 (tid 100003) too long spin lock 0xffffffff81262d00 (callout) held by 0xfffff800151fe000 (tid 100003) too long #4 0xffffffff8070d6fa in callout_process (now=7915682202423) at /usr/src/sys/kern/kern_ timeout.c:490 The ktrdump oddly only seems to have the last 702, though debug.ktr.entries is set to 1024. It appears the buffer may also start after 100003 had already hung? I've bumped debug.ktr.entries up in case we don't have enough history here. http://nitrology.com/ktrdump-spinlock.txt Jason