From owner-freebsd-stable@FreeBSD.ORG Fri Jun 4 04:52:24 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6881E1065672 for ; Fri, 4 Jun 2010 04:52:24 +0000 (UTC) (envelope-from ndenev@gmail.com) Received: from fg-out-1718.google.com (fg-out-1718.google.com [72.14.220.154]) by mx1.freebsd.org (Postfix) with ESMTP id E53EF8FC18 for ; Fri, 4 Jun 2010 04:52:23 +0000 (UTC) Received: by fg-out-1718.google.com with SMTP id d23so444040fga.13 for ; Thu, 03 Jun 2010 21:52:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:subject:mime-version :content-type:from:in-reply-to:date:cc:content-transfer-encoding :message-id:references:to:x-mailer; bh=Foa89T06TCxlBGcTvNnGYuLnqV2vB8wX5FLDt6LVAMY=; b=PZb3rG6PD1Nmo2fzlyuHfY/hM+6wJ5dwtPil2ZDqlI5XjIQFZUimxAo7/m8Pt6V4iM GwP+SORT62Tf5qs25s187d96e+Idt2C6QYKnUhUX4O0wLXOsOeIHzKFi28ldjqfah6JT kEFVEJlDtgH3bNUmVPzf8BDTSfqYfCdL+Az6Y= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=subject:mime-version:content-type:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to:x-mailer; b=x9h7VXf6iTU3hkAjKukTMa1PmRqjiQM9iLCYR43JfZ4p3hfxH+aCN9U2xXJ4K1vc+I F7FR6clmhSiXstI6Y58Ib6W++IN0/ltxoouUXolVmjlnQyw9ebxL1XqsHh+SfeBCa6OT chz6TlfBeJipbov2YnHO136yJS/mF9mvJIQ7U= Received: by 10.87.66.14 with SMTP id t14mr16898497fgk.64.1275627142796; Thu, 03 Jun 2010 21:52:22 -0700 (PDT) Received: from ndenev.totalterror.net (93-152-151-19.ddns.onlinedirect.bg [93.152.151.19]) by mx.google.com with ESMTPS id d6sm3969240fga.23.2010.06.03.21.52.20 (version=TLSv1/SSLv3 cipher=RC4-MD5); Thu, 03 Jun 2010 21:52:21 -0700 (PDT) Mime-Version: 1.0 (Apple Message framework v1078) Content-Type: text/plain; charset=us-ascii From: Nikolay Denev In-Reply-To: <20100604003502.GF13502@michelle.cdnetworks.com> Date: Fri, 4 Jun 2010 07:52:19 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: References: <77DFF2E5-7A1E-4063-A852-2C7AD9BC3DD4@gmail.com> <201005240948.33555.jhb@freebsd.org> <20100524171210.GA1418@michelle.cdnetworks.com> <87BA8EDC-BE95-4C84-94CD-5CA12961708A@gmail.com> <20100604003502.GF13502@michelle.cdnetworks.com> To: pyunyh@gmail.com X-Mailer: Apple Mail (2.1078) Cc: freebsd-stable@freebsd.org Subject: Re: if_sge related panics X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 04 Jun 2010 04:52:24 -0000 On Jun 4, 2010, at 3:35 AM, Pyun YongHyeon wrote: > On Thu, Jun 03, 2010 at 09:29:20AM +0300, Nikolay Denev wrote: >> On May 24, 2010, at 8:12 PM, Pyun YongHyeon wrote: >>=20 >>> On Mon, May 24, 2010 at 09:48:33AM -0400, John Baldwin wrote: >>>> On Monday 24 May 2010 6:35:01 am Nikolay Denev wrote: >>>>> On May 24, 2010, at 8:57 AM, Nikolay Denev wrote: >>>>>=20 >>>>>> Hi, >>>>>>=20 >>>>>> Recently I started to experience a if_sge(4) related panic. >>>>>> It happens almost every time I try to download a torrent file for = example. >>>>>> Copying of large files over NFS seem not to trigger it, but I = haven't tested extensively. >>>>>>=20 >>>>>> Here is the panic message : >>>>>>=20 >>>>>> Fatal trap 12: page fault while in kernel mode >>>>>> cpuid =3D 0; apic id =3D 00 >>>>>> fault virtual address =3D 0x8 >>>>>> fault code =3D supervisor write = data, page not present >>>>>> instruction pointer =3D 0x20:0xffffffff80230413 >>>>>> stack pointer =3D = 0x28:0xffffff80001e9280 >>>>>> frame pointer =3D 0x28:0xffffff80001e9510 >>>>>> code segment =3D base 0x0, limit 0xfffff, = type 0x1b >>>>>> =3D DPL 0, pres 1, long = 1, def32 0, gran 1 >>>>>> processor eflags =3D interrupt enabled, resume, = IOPL =3D 0 >>>>>> current process =3D 12 (irq19: sge0) >>>>>> trap number =3D 12 >>>>>> panic: page fault >>>>>> cpuid =3D 0 >>>>>> Uptime: 1d20h56m20s >>>>>> Cannot dump. Device not defined or unavailable >>>>>> Automatic reboot in 15 seconds - press a key on the console to = abort >>>>>> Sleeping thread (tid 100039, pid 12) owns a non-sleepable lock >>>>>>=20 >>>>>> My swap is on a zvol, so I don't have dump. I'll try to attach a = disk on the eSATA port and dump there if needed. >>>>>=20 >>>>> Here is some info from the crashdump : >>>>>=20 >>>>> (kgdb) #0 doadump () at pcpu.h:223 >>>>> #1 0xffffffff802fb149 in boot (howto=3D260) >>>>> at /usr/src/sys/kern/kern_shutdown.c:416 >>>>> #2 0xffffffff802fb57c in panic (fmt=3D0xffffffff8055d564 "%s") >>>>> at /usr/src/sys/kern/kern_shutdown.c:590 >>>>> #3 0xffffffff805055b8 in trap_fatal (frame=3D0xffffff000288a3e0, = eva=3DVariable "eva" is not available. >>>>> ) >>>>> at /usr/src/sys/amd64/amd64/trap.c:777 >>>>> #4 0xffffffff805059dc in trap_pfault (frame=3D0xffffff80001e91d0, = usermode=3D0) >>>>> at /usr/src/sys/amd64/amd64/trap.c:693 >>>>> #5 0xffffffff805061c5 in trap (frame=3D0xffffff80001e91d0) >>>>> at /usr/src/sys/amd64/amd64/trap.c:451 >>>>> #6 0xffffffff804eb977 in calltrap () >>>>> at /usr/src/sys/amd64/amd64/exception.S:223 >>>>> #7 0xffffffff80230413 in sge_start_locked = (ifp=3D0xffffff000270d800) >>>>> at /usr/src/sys/dev/sge/if_sge.c:1591 >>>>=20 >>>> Try this. sge_encap() can sometimes return an error with m_head = set to NULL: >>>>=20 >>>=20 >>> Thanks John. Committed in r208512. >>>=20 >>>> Index: if_sge.c >>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>>> --- if_sge.c (revision 208375) >>>> +++ if_sge.c (working copy) >>>> @@ -1588,7 +1588,8 @@ >>>> if (m_head =3D=3D NULL) >>>> break; >>>> if (sge_encap(sc, &m_head)) { >>>> - IFQ_DRV_PREPEND(&ifp->if_snd, m_head); >>>> + if (m_head !=3D NULL) >>>> + IFQ_DRV_PREPEND(&ifp->if_snd, m_head); >>>> ifp->if_drv_flags |=3D IFF_DRV_OACTIVE; >>>> break; >>>> } >>>>=20 >>>> --=20 >>>> John Baldwin >>=20 >> After the patch I experienced several network outages (ping reporting = "no buffer space available") >> that were resolved by ifconfig down/up of the sge(4) interface. >>=20 >=20 > Because I don't have access to sge(4) controllers I never had chance > to run it. Does ping(8) generates "no buffer space available" when > the system is in idle state? Could you show me more information on > how you checked network outages? >=20 It happened 4-5 times recently. I didn't do extensive investigation, but = yes, ping returned "no buffer space avail" when I tried pinging from the machine = itself. It was unreachable from other hosts on the network. I'm not sure what you bean by idle state but there was a torrent client = running on the machine, which printed errors about inability to reach peers. >> I can see that most of the other drivers that handle XXX_encap() = returning m_head pointing NULL, break when this condition >=20 > Yes, most drivers written/touched by me behaves like that. >=20 >> is hit: i.e. : >>=20 >> Index: if_sge.c >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> --- if_sge.c (revision 208375) >> +++ if_sge.c (working copy) >> @@ -1588,7 +1588,8 @@ >> if (m_head =3D=3D NULL) >> break; >> if (sge_encap(sc, &m_head)) { >> - IFQ_DRV_PREPEND(&ifp->if_snd, m_head); >> + if (m_head =3D=3D NULL) >> + break; >> IFQ_DRV_PREPEND(&ifp->if_snd, m_head); >> ifp->if_drv_flags |=3D IFF_DRV_OACTIVE; >> break; >> } >>=20 >> But here in sge(4) we always set IFF_DRV_OACTIVE. >> Do you think this can be the source of the problem ? >>=20 >=20 > More correct way to set IFF_DRV_OACTIVE would be check the number > of queued frames or just exit the transmit loop. If there is no > queued frames, IFF_DRV_OACTIVE would never be cleared which in turn > cause ENOBUFS in ping(8). I think your change looks more reasonable > to me. Do you still see the same issue with the change you suggested? I'm runing with this change for a day or something now without any = issues. Thanks, Niki=