From owner-freebsd-stable@FreeBSD.ORG Thu Oct 30 22:44:40 2014 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 88CC4CA1; Thu, 30 Oct 2014 22:44:40 +0000 (UTC) Received: from khavrinen.csail.mit.edu (khavrinen.csail.mit.edu [IPv6:2001:470:8b2d:1e1c:21b:21ff:feb8:d7b0]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "khavrinen.csail.mit.edu", Issuer "Client CA" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 30D58D8B; Thu, 30 Oct 2014 22:44:40 +0000 (UTC) Received: from khavrinen.csail.mit.edu (localhost [127.0.0.1]) by khavrinen.csail.mit.edu (8.14.9/8.14.9) with ESMTP id s9UMicgI034127 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=FAIL CN=khavrinen.csail.mit.edu issuer=Client+20CA); Thu, 30 Oct 2014 18:44:38 -0400 (EDT) (envelope-from wollman@khavrinen.csail.mit.edu) Received: (from wollman@localhost) by khavrinen.csail.mit.edu (8.14.9/8.14.9/Submit) id s9UMic4t034124; Thu, 30 Oct 2014 18:44:38 -0400 (EDT) (envelope-from wollman) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <21586.48982.64913.250497@khavrinen.csail.mit.edu> Date: Thu, 30 Oct 2014 18:44:38 -0400 From: Garrett Wollman To: freebsd-stable@freebsd.org, freebsd-fs@freebsd.org Subject: Definite NFS bug X-Mailer: VM 7.17 under 21.4 (patch 22) "Instant Classic" XEmacs Lucid X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3 (khavrinen.csail.mit.edu [127.0.0.1]); Thu, 30 Oct 2014 18:44:38 -0400 (EDT) Cc: rmacklem@freebsd.org X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Oct 2014 22:44:40 -0000 Like many other users, I upgrade my FreeBSD servers by NFS-mounting /usr/src and /usr/obj from a shared build server.[1] Since I upgraded the build server to 9.3, clients running 9.3 kernels have been randomly erroring out during installkernel and installworld. Today I had some time to look more closely into this and found that the error is definitely coming from the server: at some point, it just randomly starts returning errors to client ACCESS and GETATTR operations. The errors are a mix of NFS3ERR_IO and NFS3ERR_ACCES, but there is nothing on the server to indicate any kind of error, and restarting the operation on the client causes it to fail in a different place. With enough patients and restarts, it's possible to complete the installation in just four or five passes. Needless to say this is a bit worrying. Strangely, 9.1 and 9.2 clients don't see this issue at all; it's only 9.3 clients that break. It's easy to reproduce, just 'cd /usr/sc && find . -type f >/dev/null'. It does not seem to depend on the client NFS version (3 or 4) or implementation ("old" or "new"). I haven't tried the "old" server yet -- I'll need to figure out how to do that first. If anyone is willing to help debug this, I can share a packet trace, but I don't think it's very informative. Also, if anyone has a good dtrace script that I could run on the server that would report what's going on when that first NFS3ERR_IO is returned, that would be great. -GAWollman [1] I'd run my own freebsd-update server but unfortunately it is too tied to building things that look like official FreeBSD security updates, and isn't really designed for (e.g.) updating kernels when we change a configuration option. It also doesn't have any obvious knobs for building with anything other than a default {make,src}.conf. And with a pkg-able base just around the corner I don't really want to put much effort into making freebsd-update do what I want. NFS, on the other hand, is a big deal and so I need to track down and fix these bugs.