Skip site navigation (1)Skip section navigation (2)
Date:      26 Apr 2008 17:14:46 +0200
From:      "Arno J. Klaassen" <arno@heho.snv.jussieu.fr>
To:        Mike Tancsa <mike@sentex.net>
Cc:        stable@freebsd.org, pluknet@gmail.com
Subject:   Re: nfs-server silent data corruption
Message-ID:  <wp7iek4pi1.fsf@heho.snv.jussieu.fr>
In-Reply-To: <200804222155.m3MLtoKt093783@lava.sentex.ca>
References:  <wpmyno2kqe.fsf@heho.snv.jussieu.fr> <20080421094718.GY25623@hub.freebsd.org> <wp63ubp8e0.fsf@heho.snv.jussieu.fr> <200804211537.m3LFbaZA086977@lava.sentex.ca> <wpy77650s0.fsf@heho.snv.jussieu.fr> <200804221501.m3MF1guW092221@lava.sentex.ca> <wpzlrlu6w7.fsf@heho.snv.jussieu.fr> <200804221741.m3MHfYjO092795@lava.sentex.ca> <wpabjln518.fsf@heho.snv.jussieu.fr> <200804221807.m3MI73bN092981@lava.sentex.ca> <wpk5ipkaaa.fsf@heho.snv.jussieu.fr> <200804222155.m3MLtoKt093783@lava.sentex.ca>

next in thread | previous in thread | raw e-mail | index | archive | help


Hello,


Mike Tancsa <mike@sentex.net> writes:

> At 02:35 PM 4/22/2008, Arno J. Klaassen wrote:
> 
> > > Also, you are using ULE or the 4BSD scheduler ?  I
> > > still have 4BSD on the box I am testing on.
> >
> >Interesting, this is with ULE. I didn't really test 4BSD on this
> >box (I believed those who said SMP needs ULE *and* am quite
> >satisfied with overall performance). I'll try 4BSD though time
> >is getting short; I promised to deliver this box next thursday but will
> >still have some days for on-site testing.
> 
> 
> I have recompiled the kernel with ULE, and it seems fine as well.  I
> ran 160 iterations of a 300MB file and there was no corruption.  Same
> process - copy a junk random file over nfs mount, unmount the nfs
> mount, remount it copy it back, compare the files.


Let me summarise my investigations till now :


- in all failing cases just *one* byte is currupted, 4 or all 8 bits
  set to zero *and* the original value is one out of the limited
  subset {1, 8, 9} ....

  here is the output of `cmp -x $i/BIG $i/BIG2` for some failing
  cases I saved :


  03869a48 09 00
  05209d88 09 00
  01777148 09 00
  00f10f88 09 00
  01f4c4c8 11 00
  06c3d6c8 11 00
  0725ca48 18 00
  01608008 09 00
  00f3b888 18 00

  07aa45c8 29 20


- it does *not* seem to depend on :

   - the interface : I could produce it using nfe0, nfe1 and 
     re0 using some netgear pci-card

   - the distribution of the 4Gig memory : installing 4G at 
     CPU1 or 1G at CPU1 and 2G at CPU2 produces same results
     (NB, all memory passed memtest.iso in both situtations
      for complete run)

   - the frequency control method : easier to produce with
     cpufreq/powerd, but finally I can reproduce the cooruption
     as well using acpi_ppc

   - the nfs-client and options (not exhaustively tested, but different
     test include i386-releng6, amd64-releng6 and linux, and quite
     a set of different try and see mounf_nfs options

I am testing right now with a fixed frequency of 1Ghz.

I am not so inclined to test 4BSD, since reboot possibilities are
limited for me now on this box, but I set up next week a similar
board (S3992e) (iff I can find quad-core socket F over here ...)
and in a certain sense hope I can reproduce it an that board as well.

Best, Arno



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?wp7iek4pi1.fsf>