From owner-freebsd-current@FreeBSD.ORG Fri Jan 13 16:46:09 2012 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DE15C106564A for ; Fri, 13 Jan 2012 16:46:08 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-annu.mail.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 973AD8FC14 for ; Fri, 13 Jan 2012 16:46:08 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Ap8EACZfEE+DaFvO/2dsb2JhbABChRCpEIFyAQEBAwEBAQEgKyALBRYYAgINGQIpAQkmBggHBAEIFASHWQilZ5E6gS+JWIEWBIg8ii+CJ4p8h2E X-IronPort-AV: E=Sophos;i="4.71,505,1320642000"; d="scan'208";a="151975781" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-annu-pri.mail.uoguelph.ca with ESMTP; 13 Jan 2012 11:46:04 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id C30C4B3F7C; Fri, 13 Jan 2012 11:46:04 -0500 (EST) Date: Fri, 13 Jan 2012 11:46:04 -0500 (EST) From: Rick Macklem To: Martin Cracauer Message-ID: <443595541.203994.1326473164783.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <20120113143711.GA62486@cons.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.203] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692) Cc: freebsd-current@freebsd.org, Stefan Bethke Subject: Re: Data corruption over NFS in -current X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 13 Jan 2012 16:46:09 -0000 Martin Cracauer wrote: > More findings. > > Reminder, with the original report I found: > - files for no reason changing ownership and group to > root/ > - data corruption as in inserting binary junk obviously from ports > - data corruption as in malformed ascii text that might be a bug I > have in my code that is only exposed in FreeBSD > > I ran the script on a Linux machine in the same situation again the > same > NFS server, it worked fine. I haven't look at blocksizes, NFS > versions etc in play yet. > > I ran with oldnfs (reboot), which showed only the third problem. > > I re-ran with newfs (reboot) which worked (all three problems absent). > Since this test worked, it suggests that problem #3 is not a bug in your software, unless your runs aren't processing the same data. However, a test using a local disk to confirm this, would be nice. > I then started building ports/land/gcc47 at the same time as I > re-started my crazy script and it too only a few seconds for an > unexpected ownership to root to occur. > Well, from my experience, isolating a problem like this is much easier if you can reproduce it reliably. I'd try this a few times and if doing ports/land/gcc47 concurrently reproduces the problem reliably, then I'd use that for all the testing. (I'd suggest you re-do the above tests doing ports/land/gcc47 concurrently with the script.) Also, I'd run "systat -vmstat" or similar (others may have better suggestions than "systat -vmstat"?) while running the tests, to see if there might be a memory exhaustion issue. (Daniel mentioned he had seen this, if I understood his post correctly. Maybe he can elaborate on how he spotted the memory exhaustion?) > My next steps are: > - trying block sizes and other parameters, maybe use a different NFS > version with the Linux client. My NFS server is newly upgraded to > Linux kernel 3.1.5 or go back to the old version of the NFS server, if that is feasible. Two changes (new Linux NFS server and new FreeBSD version) at about the same time, makes it harder to point your finger at the problem. > - running my script on a FreeBSD host with local disk to see whether > problem #3 is a general problem that appears or is exposed only on > FreeBSD It might also be useful to run this FreeBSD host with local disk using the NFS mount and having a swap partition on the disk. (Again, related to what Daniel mentioned.) > - capture tcpdump as mentioned earlier > If the combination of running the script and ports/land/gcc47 reproduces the problem reliably, then doing a tcpdump should be straightforward. Good luck with it. I'll admit I doubt this will be resolved quickly or easily, but pursuing it as far as you can find the time to do so will be appreciated by others who might run into the same problem. rick > I will probably have to turn debug off since this script run is > dominated by system time now and gets 10x slower as it is now. > > Martin > -- > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% > Martin Cracauer http://www.cons.org/cracauer/ > _______________________________________________ > freebsd-current@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to > "freebsd-current-unsubscribe@freebsd.org"