From owner-freebsd-net@FreeBSD.ORG Fri Aug 26 01:24:51 2011 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CE5D9106564A; Fri, 26 Aug 2011 01:24:51 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 1861F8FC08; Fri, 26 Aug 2011 01:24:51 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEAJj1Vk6DaFvO/2dsb2JhbABDhEykPIFAAQEBAQIBAQEBIAQnIAsFFg4KAgINGQIpAQkmBggHBAEaAgSHUQSoQpFegSyED4ERBJEJghCRFw X-IronPort-AV: E=Sophos;i="4.68,283,1312171200"; d="scan'208";a="135594101" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-jnhn-pri.mail.uoguelph.ca with ESMTP; 25 Aug 2011 21:24:28 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 1C93DB3F0F; Thu, 25 Aug 2011 21:24:28 -0400 (EDT) Date: Thu, 25 Aug 2011 21:24:28 -0400 (EDT) From: Rick Macklem To: Artem Belevich Message-ID: <1499650185.371230.1314321868068.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.203] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692) Cc: freebsd-net@freebsd.org, Martin Birgmeier Subject: Re: amd + NFS reconnect = ICMP storm + unkillable process. X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 26 Aug 2011 01:24:51 -0000 Artem Belevich wrote: > On Wed, Jul 6, 2011 at 4:50 AM, Martin Birgmeier > wrote: > > Hi Artem, > > > > I have exactly the same problem as you are describing below, also > > with quite > > a number of amd mounts. > > > > In addition to the scenario you describe, another way this happens > > here > > is when downloading a file via firefox to a directory currently open > > in > > dolphin (KDE file manager). This will almost surely trigger the > > symptoms > > you describe. > > > > I've had 7.4 running on the box before, now with 8.2 this has > > started to > > happen. > > > > Alas, I don't have a solution. > > I may be on to something. Here's what seems to be happening in my > case: > > * Process, that's in the middle of a syscall accessing amd mountpoint > gets interrupted. > * If the syscall was restartable, msleep at the beginning of > get_reply: loop in in clnt_dg_call() would return ERESTART. > * ERESTART will result in clnt_dg_call() returning with RPC_CANTRECV > * clnt_reconnect_call() then will try to reconnect, and msleep will > eventually fail with ERESTART in clnt_dg_call() again and the whole > thing will be repeating for a while. > Btw, I fixed exactly the same issue for the TCP code (clnt_vc.c) in r221127, so I wouldn't be surprised if the UDP code suffers the same problem. I'll take a look at your patch tomorrow. You could also try a TCP mount and see if the problem goes away. (For TCP on a pre-r221127 system, the symptom would be a client thread looping in the kernel in "R" state.) I'll look tomorrow, but it sounds like you've figured it out. Looks like a good catch to me at this point, rick > I'm not familiar enough with the RPC code, but looking and clnt_vc.c > and other RPC places, it appears that both EINTR and ERESTART should > translate into RPC_INTR error. However in clnt_dg.c that's not the > case and that's what seems to make amd-mounted accesses hang. > > Following patch (against RELENG-8 @ r225118) seems to have fixed the > issue for me. With the patch I no longer see the hangs or ICMP storms > on the test case that could reliably reproduce the issue within > minutes. Let me know if it helps in your case. > > --- a/sys/rpc/clnt_dg.c > +++ b/sys/rpc/clnt_dg.c > @@ -636,7 +636,7 @@ get_reply: > */ > if (error != EWOULDBLOCK) { > errp->re_errno = error; > - if (error == EINTR) > + if (error == EINTR || error == ERESTART) > errp->re_status = stat = RPC_INTR; > else > errp->re_status = stat = RPC_CANTRECV; > > --Artem > > > > > We should probably file a PR, but I don't even know where to assign > > it to. > > Amd does not seem much maintained, it's probably using some > > old-style > > mounts (it never mounts anything via IPv6, for example). > > > > Regards, > > > > Martin > > > >> Hi, > >> > >> I wonder if someone else ran into this issue before and, maybe, > >> have a > >> solution. > >> > >> I've been running into a problem where access to filesystems mouted > >> with amd wedges processes in an unkillable state and produces ICMP > >> storm on loopback interface.I've managed to narrow down to NFS > >> reconnect, but that's when I ran out of ideas. > >> > >> Usually the problem happens when I abort a parallel build job in an > >> i386 jail on FreeBSD-8/amd64 (r223055). When the build job is > >> killed > >> now and then I end up with one process consuming 100% of CPU time > >> on > >> one of the cores. At the same time I get a lot of messages on the > >> console saying "Limiting icmp unreach response from 49837 to 200 > >> packets/sec" and the loopback traffic goes way up. > >> > >> As far as I can tell here's what's happening: > >> > >> * My setup uses a lot of filesystems mounted by amd. > >> * amd itself pretends to be an NFS server running on the localhost > >> and > >> serving requests for amd mounts. > >> * Now and then amd seems to change the ports it uses. Beats me why. > >> * the problem seems to happen when some process is about to access > >> amd > >> mountpoint when amd instance disappears from the port it used to > >> listen on. In my case it does correlate with interrupted builds, > >> but I > >> have no clue why. > >> * NFS client detects disconnect and tries to reconnect using the > >> same > >> destination port. > >> * That generates ICMP response as port is unreachable and it > >> reconnect > >> call returns almost immediatelly. > >> * We try to reconnect again, and again, and again.... > >> * the process in this state is unkillable > >> > >> Here's what the stack of the 'stuck' process looks like in those > >> rare > >> moments when it gets to sleep: > >> 18779 100511 collect2 - mi_switch+0x176 > >> turnstile_wait+0x1cb _mtx_lock_sleep+0xe1 > >> sleepq_catch_signals+0x386 > >> sleepq_timedwait_sig+0x19 _sleep+0x1b1 clnt_dg_call+0x7e6 > >> clnt_reconnect_call+0x12e nfs_request+0x212 nfs_getattr+0x2e4 > >> VOP_GETATTR_APV+0x44 nfs_bioread+0x42a VOP_READLINK_APV+0x4a > >> namei+0x4f9 kern_statat_vnhook+0x92 kern_statat+0x15 > >> freebsd32_stat+0x2e syscallenter+0x23d > >> > >> * Usually some timeout expires in few minutes, the process dies, > >> ICMP > >> storm stops and the system is usable again. > >> * On occasion the process is stuck forever and I have to reboot the > >> box. > >> > >> I'm not sure who's to blame here. > >> > >> Is the automounter at fault for disappearing from the port it was > >> supposed to listen to? > >> If NFS guilty of trying blindly to reconnect on the same port and > >> not > >> giving up sooner? > >> Should I flog the operator (ALA myself) for misconfiguring > >> something > >> (what?) in amd or NFS? > >> > >> More importantly -- how do I fix it? > >> Any suggestions on fixing/debugging this issue? > >> > >> --Artem > > > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"