Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 06 Jul 2012 23:02:14 +0100
From:      Vincent Hoffman <vince@unsane.co.uk>
To:        freebsd-stable@freebsd.org
Subject:   Re: nfs-bug when server for 9-Stable becomes client as well ?
Message-ID:  <4FF76066.1000401@unsane.co.uk>
In-Reply-To: <wp8vewsrqn.fsf@heho.snv.jussieu.fr>
References:  <wpy5mxxc1f.fsf@heho.snv.jussieu.fr> <4FF7055D.9000507@unsane.co.uk> <wp8vewsrqn.fsf@heho.snv.jussieu.fr>

next in thread | previous in thread | raw e-mail | index | archive | help
On 06/07/2012 18:51, Arno J. Klaassen wrote:
> Vincent Hoffman <vince@unsane.co.uk> writes:
>
>> On 06/07/2012 14:19, Arno J. Klaassen wrote:
>>> Hello,
>>>
>>> looks like I discouvered a probable bug in the nfs-code, very
>>> easy to reproduce in my setup :
>>>
>>>
>>>    Machine-1 : Today's 9-stable, exporting /files (ufs) and /z2 (zfs)
>>>
>>>    Machine-2 : 8-stable as of April the 10th exporting /raid1
>>>
>>> On Machine-1 I mount /raid1 (rw,nfsv3,intr,tcp,rsize=32768,wsize=32768)
>>> and start a script on this mount looping something like :
>>>
>>>   dd if=/dev/random of=BIG bs=1048576 count=${SIZE}
>>>   cp -fp BIG BIG2
>>>   cmp -x BIG BIG2
>>>
>>> I let this run for 24 hours (from time to time stressing Machine-1 with
>>> other scripts, including provoking heavy swapping), no problem at all.
>>>
>>> However, then I mount /z2 (rw,nfsv3,intr,tcp,rsize=32768,wsize=32768)
>>> on Machine-2, and *immediately* the above loop on Machine-1 fails :
>>>
>>>   Copying file ...cp: BIG: Permission denied
>>>
>>> No console messages this time, last time I got 
>>>
>>>   kernel: nfs_getpages: error 13
>>>   kernel: vm_fault: pager read error, pid 87803 (cmp)
>>>
>>> on Machine-1.
>>>
>>> I repeated this scenario by replacing Machine-2 with a good old
>>> 6-4-stable one, same outcome.
>>>
>>> Please tell me what I could do to nail this down a bit more.
>> Its possible (although not definite) that you have hit the a mountd bug
>> as documented in PRs
>>
>> kern/131342
>> kern/136865
> especially kern/131342 looks similar and quite old; funny I never hit
> this before, I basically do the same tests since 'ages' on each new box.
> Could be that faster network/cpu unreveals some race condition; I notice
> as well that this server is the first (IIRC) who uses 3 different IRQs
> for network interrupts (em(4) Intel(R) PRO/1000).
Certainly possible and seems reasonable enough.
>
>> I've recently asked on -CURRENT about this and had a patch to try from
>> Rick, I'm testing it now but it doesnt seem to fix it for me, just
>> improve it alothough I'm trying to get enough runs to be a valid sample.
>> (see
>> http://docs.freebsd.org/cgi/getmsg.cgi?fetch=377627+0+archive/2012/freebsd-current/20120701.freebsd-current
>> )
>>
>> What I did for my production nas was edit mount.c so it didnt send a
>> SIGHUP to mountd as suggested by rick, as it was easy to do and non
>> intrusive.
> hmm, this means I should patch each fbsd-client, no? May be easier to
> patch mountd to ignore SIHGUP and use some non-standard signal to force
> re-init?
No just patch /sbin/mount on the nfs server so it doesnt send the SIGHUP
to mountd.
you can manually HUP mountd if needed.
>
> Arno
>
>
>> Vince
>>
>>> Thanx in advance,
>>>
>>> Best, Arno





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4FF76066.1000401>