Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 21 Apr 2008 08:43:33 -0700
From:      Jeremy Chadwick <koitsu@freebsd.org>
To:        "Arno J. Klaassen" <arno@heho.snv.jussieu.fr>
Cc:        Clayton Milos <clay@milos.co.za>, Kris Kennaway <kris@FreeBSD.ORG>, stable@FreeBSD.ORG, net@FreeBSD.ORG
Subject:   Re: nfs-server silent data corruption
Message-ID:  <20080421154333.GA96237@eos.sc1.parodius.com>
In-Reply-To: <wp63ubp8e0.fsf@heho.snv.jussieu.fr>
References:  <wpmyno2kqe.fsf@heho.snv.jussieu.fr> <20080421094718.GY25623@hub.freebsd.org> <wp63ubp8e0.fsf@heho.snv.jussieu.fr>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Apr 21, 2008 at 04:52:55PM +0200, Arno J. Klaassen wrote:
> Kris Kennaway <kris@FreeBSD.ORG> writes:
> > Uh, you're getting server-side data corruption, it could definitely be
> > because of the memory you added.
> 
> yop, though I'm still not convinced the memory is bad (the very same
> Kingston ECC as the 2*1G in use for about half a year already) :

Can you download and run memtest86 on this system, with the added 2G ECC
insalled?  memtest86 doesn't guarantee showing signs of memory problems,
but in most cases it'll start spewing errors almost immediately.

One thing I did notice in the motherboard manual below is something
called "Hammer Configuration".  It appears to default to 800MHz, but
there's an "Auto" choice.  Does using Auto fix anything?

> I added it directly to the 2nd CPU (diagram on page 9 of
>  http://www.tyan.com/manuals/m_s2895_101.pdf) and the problem
> seems to be the interaction between nfe0 and powerd .... :

That board is the weirdest thing I've seen in years.

Two separate CPUs using a single (shared) memory controller, two
separate (and different!) nVidia chipsets, a SMSC I/O controller
probably used for serial and parallel I/O, two separate nVidia NICs with
Marvell PHYs (yet somehow you can bridge the two NICs and PHYs?), two
separate PCI-e busses (each associated with a separate nVidia chipset),
two separate PCI-X busses... the list continues.

I know you don't need opinions at this point, but what a behemoth.  I
can't imagine that thing running reliably.

>  - if I stop powerd, problems go away

This would imply that clock frequency stepping is somehow attributing
itself to the corruption.  I don't see any BIOS options for controlling
things related to AMD's Cool-n-Quiet or PowerNow! feature, which is
usually what handles this.

>  - I let run powerd but turn of txcsum and tso4 on the interface,
>    the problem is a lot harder to produce (if ever this gives
>    a hint to anyone)

Possibly shared interrupts are causing problems?  MSI/MSI-X doing
something odd?  Have you tried disabling MSI/MSI-X and see if it makes a
difference?

Can you boot the machine in verbose mode, and put the dmesg up
somewhere?

> Device is :
> 
> nfe0@pci0:0:10:0:       class=0x068000 card=0x289510f1 chip=0x005710de rev=0xa3 hdr=0x00
>     vendor     = 'Nvidia Corp'
>     device     = 'nForce4 Ultra NVidia Network Bus Enumerator'
>     class      = bridge
>     cap 01[44] = powerspec 2  supports D0 D1 D2 D3  current D0
> 
> (this is with the default BIOS setting " LAN Bridge Enabled", disabling
>  that setting makes pciconf say "class = network" but does not influence
>  my problem)

I think you mean "MAC LAN Bridge", according to the motherboard manual.
I'm not even sure what that really does; somehow trunks the two NICs
together to give you the equivalent of 2000mbit of traffic?  I don't
know.

Does the corruption you see go away if you install a separate NIC (e.g.
an Intel NIC) in a PCI or PCI-e slot, and disable the onboard NICs
(should be "MAC LAN: Disable" on both the primary and slave)?

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080421154333.GA96237>