From owner-freebsd-current@FreeBSD.ORG Tue May 13 15:41:55 2003 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 0ADDA37B401; Tue, 13 May 2003 15:41:55 -0700 (PDT) Received: from canning.wemm.org (canning.wemm.org [192.203.228.65]) by mx1.FreeBSD.org (Postfix) with ESMTP id 3B34A43F3F; Tue, 13 May 2003 15:41:54 -0700 (PDT) (envelope-from peter@wemm.org) Received: from wemm.org (localhost [127.0.0.1]) by canning.wemm.org (Postfix) with ESMTP id 205E02A7EA; Tue, 13 May 2003 15:41:54 -0700 (PDT) (envelope-from peter@wemm.org) X-Mailer: exmh version 2.5 07/13/2001 with nmh-1.0.4 To: Don Lewis , re@FreeBSD.org, rwatson@FreeBSD.org, hschaefer@fto.de, current@FreeBSD.org In-Reply-To: <20030513221637.3B7422A8AC@canning.wemm.org> Date: Tue, 13 May 2003 15:41:54 -0700 From: Peter Wemm Message-Id: <20030513224154.205E02A7EA@canning.wemm.org> Subject: Re: 5.1-RELEASE TODO X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 13 May 2003 22:41:55 -0000 Peter Wemm wrote: > Don Lewis wrote: > > On 13 May, Robert Watson wrote: > > > > > > On Tue, 13 May 2003, Heiko Schaefer wrote: > > > > > >> > That said, we are actively discussing what, if any, workarounds are > > >> > appropriate, including some alternative workarounds from the ones > > >> > currently present. > > >> > > >> bosko (who was mentioned here various time, regarding a patch to work > > >> around this) has contacted me, and i am looking forward to try his > > >> patch. assuming that the patch is correct (whatever that would mean in > > >> this context), and there is some chance of accepting it anytime soon, > > >> maybe it would be sensible to try to get that into the release - or > > >> delay the release until this is sorted out ?! > > >> > > >> wouldn't a release that corrupts data in many, relevant, cases (i > > >> consider the box i had the trouble with entirely mainstream) be worse > > >> than no release at all? > > > > > > You don't need to argue to me that we need stability (I'm a fan of it > > > myself): what I need is evidence that some set of changes is actually > > > solving the problem, not masking it. If there exists a patch that > > > substantially improves stability on some set of systems (and not at the > > > cost of another set), I think you can rest assured that we'll get it into > > > the release. As with you, we're very concerned by the recent spate of > > > instability, especially in the beta cycle, and how to address that is ver y > > > much on our minds. > > > > Both my AMD system running -current and PII system running -stable are > > afflicted with these data corruption problems. The limited amount of > > information that I've seen about these problems leads me to believe that > > in order to use the 4 MB page feature without danger to system integrity > > is to relocate the kernel. If this is the case, then it would seem to > > make sense to disable the use of 4 MB pages by adding the DISABLE_PSE > > option until the system is patched. > > The thing is, we only use 4MB pages for two things. > 1) The first 4MB of KVM is mapped as a 4MB page. > 2) Large device mappings, eg: the Xserver mmaping /dev/mem for the frame > buffer. The thing is though, these 4MB pages are not mapped with PG_G. > > Are you running X? Are you using the broadcom ethernet driver? > > Also of note: I recently saw a brand new P4 system with a genuine intel > motherboard, for a RELENG_4 system. It had shocking data corruption > problems. The memory was swapped - no change. The motherboard and CPU were > swapped (same motherboard model, much newer P4 cpu stepping) - no change. > It was simply unreliable. Backporting DISABLE_PG_G to RELENG_4 and turning > on it and DISABLE_PSE greatly reduced the problem, but it still happened. > In the end, the Intel motherboard was replaced with a P4 Xeon system > motherboard and the problem instantly went away. The trouble appeared > to be a generic problem the Intel 845 chipset motherboard. > > Remember, this was RELENG_4 as of a few months ago. It isn't a 5.x-only > problem. > > The bge driver has been firmly implicated in one of the cases of data > corruption. Paul's recent if_bge fixes completely solved one person's > long-standing problems. There are DMA bugs in the earlier chipsets that > we didn't have the prescribed workarounds for. And even though the compiles > were happening on local disks, all it took was running the build in an Xterm > so that the make output was going over the network, or doing a tail -f etc. > > > PG_G is probably different. A better case can be made that using this > > option is only masking software bugs that should be fixable. The > > problem is that these bugs are only rarely triggered, look a lot like > > flakey hardware, and it's just about impossible for most FreeBSD users > > to track the problem to its root cause. > > For what its worth, we have #ifdef'ed code in i386/pmap.c: > #ifdef I686_CPU_not /* Problem seems to have gone away */ > /* Deal with un-resolved Pentium4 issues */ > if (cpu_class == CPUCLASS_686 && > strcmp(cpu_vendor, "GenuineIntel") == 0 && > (cpu_id & 0xf00) == 0xf00) { > printf("Warning: Pentium 4 cpu: PG_G disabled (global flag)\n "); > pgeflag = 0; > } > #endif > > I really do not want DISABLE_PSE and DISABLE_PG_G turned on for what appears > to have a hardware component. I'd much rather the above ifdef's turned on. > > For the folks having problems, here's what I'd like to know: > > - Are you running X? (standard XFree86 or do you have the agp and drm driver s > enabled?) > - What ethernet card? (particularly if bge) > - Is there any network traffic at the time? ie: if you remove the network > card entirely and do the compile tests on a /dev/ttyv0 console, does it still > happen? > - What hardware do you have? (cpuid line shoing the Id = 0xNNN number, > memory size/type and whether it has ECC or not, motherboard chipset, etc) > - Have you replaced any hardware? If so, which parts? Oh, and two more things: - Do DISABLE_PG_G and/or DISABLE_PSE actually affect the stability? - Are you seeing application faults (segfault etc) or kernel stability (fatal trap, panic etc). Cheers, -Peter -- Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com "All of this is for nothing if we don't go to the stars" - JMS/B5