Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 13 May 2003 15:16:37 -0700
From:      Peter Wemm <peter@wemm.org>
To:        Don Lewis <truckman@FreeBSD.org>
Cc:        current@FreeBSD.org
Subject:   Re: 5.1-RELEASE TODO 
Message-ID:  <20030513221637.3B7422A8AC@canning.wemm.org>
In-Reply-To: <200305132101.h4DL1QM7051267@gw.catspoiler.org> 

next in thread | previous in thread | raw e-mail | index | archive | help
Don Lewis wrote:
> On 13 May, Robert Watson wrote:
> > 
> > On Tue, 13 May 2003, Heiko Schaefer wrote:
> > 
> >> > That said, we are actively discussing what, if any, workarounds are
> >> > appropriate, including some alternative workarounds from the ones
> >> > currently present.
> >> 
> >> bosko (who was mentioned here various time, regarding a patch to work
> >> around this) has contacted me, and i am looking forward to try his
> >> patch.  assuming that the patch is correct (whatever that would mean in
> >> this context), and there is some chance of accepting it anytime soon,
> >> maybe it would be sensible to try to get that into the release - or
> >> delay the release until this is sorted out ?! 
> >> 
> >> wouldn't a release that corrupts data in many, relevant, cases (i
> >> consider the box i had the trouble with entirely mainstream) be worse
> >> than no release at all? 
> > 
> > You don't need to argue to me that we need stability (I'm a fan of it
> > myself): what I need is evidence that some set of changes is actually
> > solving the problem, not masking it.  If there exists a patch that
> > substantially improves stability on some set of systems (and not at the
> > cost of another set), I think you can rest assured that we'll get it into
> > the release.  As with you, we're very concerned by the recent spate of
> > instability, especially in the beta cycle, and how to address that is very
> > much on our minds. 
> 
> Both my AMD system running -current and PII system running -stable are
> afflicted with these data corruption problems.  The limited amount of
> information that I've seen about these problems leads me to believe that
> in order to use the 4 MB page feature without danger to system integrity
> is to relocate the kernel.  If this is the case, then it would seem to
> make sense to disable the use of 4 MB pages by adding the DISABLE_PSE
> option until the system is patched.

The thing is, we only use 4MB pages for two things.
1) The first 4MB of KVM is mapped as a 4MB page.
2) Large device mappings, eg: the Xserver mmaping /dev/mem for the frame
buffer.  The thing is though, these 4MB pages are not mapped with PG_G.

Are you running X?  Are you using the broadcom ethernet driver?

Also of note:  I recently saw a brand new P4 system with a genuine intel
motherboard, for a RELENG_4 system.  It had shocking data corruption
problems. The memory was swapped - no change.  The motherboard and CPU were
swapped (same motherboard model, much newer P4 cpu stepping) - no change.
It was simply unreliable.  Backporting DISABLE_PG_G to RELENG_4 and turning
on it and DISABLE_PSE greatly reduced the problem, but it still happened.
In the end, the Intel motherboard was replaced with a P4 Xeon system
motherboard and the problem instantly went away.  The trouble appeared
to be a generic problem the Intel 845 chipset motherboard.

Remember, this was RELENG_4 as of a few months ago.  It isn't a 5.x-only
problem.

The bge driver has been firmly implicated in one of the cases of data
corruption.  Paul's recent if_bge fixes completely solved one person's
long-standing problems.  There are DMA bugs in the earlier chipsets that
we didn't have the prescribed workarounds for.  And even though the compiles
were happening on local disks, all it took was running the build in an Xterm
so that the make output was going over the network, or doing a tail -f etc.

> PG_G is probably different.  A better case can be made that using this
> option is only masking software bugs that should be fixable.  The
> problem is that these bugs are only rarely triggered, look a lot like
> flakey hardware, and it's just about impossible for most FreeBSD users
> to track the problem to its root cause.

For what its worth, we have #ifdef'ed code in i386/pmap.c:
#ifdef I686_CPU_not     /* Problem seems to have gone away */
        /* Deal with un-resolved Pentium4 issues */
        if (cpu_class == CPUCLASS_686 &&
            strcmp(cpu_vendor, "GenuineIntel") == 0 &&
            (cpu_id & 0xf00) == 0xf00) {
                printf("Warning: Pentium 4 cpu: PG_G disabled (global flag)\n");
                pgeflag = 0;
        }
#endif

I really do not want DISABLE_PSE and DISABLE_PG_G turned on for what appears
to have a hardware component.  I'd much rather the above ifdef's turned on.

For the folks having problems, here's what I'd like to know:

- Are you running X?  (standard XFree86 or do you have the agp and drm drivers
enabled?)
- What ethernet card?  (particularly if bge)
- Is there any network traffic at the time?  ie: if you remove the network
card entirely and do the compile tests on a /dev/ttyv0 console, does it still
happen?
- What hardware do you have?  (cpuid line shoing the Id = 0xNNN number,
memory size/type and whether it has ECC or not, motherboard chipset, etc)
- Have you replaced any hardware?  If so, which parts?

Cheers,
-Peter
--
Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com
"All of this is for nothing if we don't go to the stars" - JMS/B5



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20030513221637.3B7422A8AC>