From owner-freebsd-stable@FreeBSD.ORG Fri Oct 31 01:31:36 2014 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 9C60ECDB; Fri, 31 Oct 2014 01:31:36 +0000 (UTC) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 41141F37; Fri, 31 Oct 2014 01:31:35 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Aq0EAHzlUlSDaFve/2dsb2JhbABcDoQwgwLRaAKBMgEBAQEBfYQCAQEBAwEjBFIFFhgCAg0ZAlkGiEsJtVWUaAEBAQEGAQEBAQEBHIEsjyEONAeCd4FUBZ8hjWaHLYM4XCGBN0CBAwEBAQ X-IronPort-AV: E=Sophos;i="5.07,290,1413259200"; d="scan'208";a="163577126" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 30 Oct 2014 21:31:34 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 11382AE92D; Thu, 30 Oct 2014 21:31:34 -0400 (EDT) Date: Thu, 30 Oct 2014 21:31:34 -0400 (EDT) From: Rick Macklem To: Garrett Wollman Message-ID: <1902145956.2676513.1414719094052.JavaMail.root@uoguelph.ca> In-Reply-To: <21586.48982.64913.250497@khavrinen.csail.mit.edu> Subject: Re: Definite NFS bug MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.202] X-Mailer: Zimbra 7.2.6_GA_2926 (ZimbraWebClient - FF3.0 (Win)/7.2.6_GA_2926) Cc: freebsd-fs@freebsd.org, rmacklem@freebsd.org, freebsd-stable@freebsd.org X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 31 Oct 2014 01:31:36 -0000 Garrett Wollman wrote: > Like many other users, I upgrade my FreeBSD servers by NFS-mounting > /usr/src and /usr/obj from a shared build server.[1] Since I > upgraded > the build server to 9.3, clients running 9.3 kernels have been > randomly erroring out during installkernel and installworld. Today I > had some time to look more closely into this and found that the error > is definitely coming from the server: at some point, it just randomly > starts returning errors to client ACCESS and GETATTR operations. The > errors are a mix of NFS3ERR_IO and NFS3ERR_ACCES, but there is > nothing > on the server to indicate any kind of error, and restarting the > operation on the client causes it to fail in a different place. With > enough patients and restarts, it's possible to complete the > installation in just four or five passes. > > Needless to say this is a bit worrying. Strangely, 9.1 and 9.2 > clients don't see this issue at all; it's only 9.3 clients that > break. > > It's easy to reproduce, just 'cd /usr/sc && find . -type f > >/dev/null'. > It does not seem to depend on the client NFS version (3 or 4) or > implementation ("old" or "new"). I haven't tried the "old" server > yet > -- I'll need to figure out how to do that first. > Well, I took a quick look and, if I got it correct, there is one single line change in the "old" client between 9.2 and 9.3, which defined an otherwise unused mount flag called NFSMNT_NONCONTIGWR. (It is only used by the new client when "nocontigwr" is specified.) However, there was some fairly extensive changes done (mostly by mav@) to the kernel rpc (sys/rpc), which is used by both clients and both servers. Most of these changes were committed to stable/9 as r261057, r261058. If you could build a kernel from stable/9 just prior to r261057 and see if that client runs into the problem, it could help determine if these changes are causing the problem. Alternately, running the 9.3 system with a 9.2 sys/rpc (if it links/runs), that could also help see if the kernel rpc is the culprit. (You can load the kernel rpc as a module, but it's linked into most kernels.) If it doesn't turn out to be in the kernel rpc, my next guess would be changes to the net device driver (to check for this you could use a different type of hardware device or the 9.2 driver on the 9.3 system. maybe?). The "new" client has some changes 9.2->9.3, but since nothing changed for the "old" client and you see the problem with the "old" one, I think the NFS client is not the culprit. rick > If anyone is willing to help debug this, I can share a packet trace, > but I don't think it's very informative. Also, if anyone has a good > dtrace script that I could run on the server that would report what's > going on when that first NFS3ERR_IO is returned, that would be great. > > -GAWollman > > [1] I'd run my own freebsd-update server but unfortunately it is too > tied to building things that look like official FreeBSD security > updates, and isn't really designed for (e.g.) updating kernels when > we > change a configuration option. It also doesn't have any obvious > knobs > for building with anything other than a default {make,src}.conf. > And with a pkg-able base just around the corner I don't really want > to > put much effort into making freebsd-update do what I want. NFS, on > the other hand, is a big deal and so I need to track down and fix > these bugs. >