Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 11 Aug 2003 10:05:49 +0000
From:      Bosko Milekic <bmilekic@technokratis.com>
To:        freebsd-current@freebsd.org
Subject:   5.1, Data Corruption, Intel, Oh my! [patch]
Message-ID:  <20030811100549.GA33392@technokratis.com>

next in thread | raw e-mail | index | archive | help

--RnlQjJ0d97Da+TV1
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline


[If you're not interested in the history and technical information,
 and just want to try this out, please scroll down to after the [*]].

Hi,

  Over the past few months, it seems that some of our 5.x users have
  been plagued by various random data corruption problems on some
  version(s) of the Intel Pentium IV chip (and in another form, on
  some version(s) of the Intel Pentium Pro and Pentium II chips).

  While in several instances the problem has been traced to faulty
  memory, in some others a work-around solution has been in use.  That
  workaround was to include the DISABLE_PSE and DISABLE_PG_G options in
  the kernel.  It appears to have "solved" the problem.

  Although not documented (at least not publically from what I could
  find), there does appear to be [a] bug[s] on some Intel chips that has
  to do with PSE, PG, and how their co-existence with legacy 4K
  page mappings.  While the exact details of the bug are to me unknown,
  some others have come up with reasonable hypothesis and detailed
  accounts (over my head, but Terry has a fairly consistent one).

  Regardless, I have come up with a patch that hopefully solves the data
  corruption problem for those of you who have been running with
  DISABLE_PSE and DISABLE_PG_G in your kernels.  The patch is attached
  to this Email.

  What the patch does does not directly uncover the true nature of the
  bug (as I don't know that myself), but it does appear to solve the
  problem at least for a couple of users I've been confirming with.  The
  first thing it does is load the kernel at 0x400000 instead of 0x100000
  (this part was implemented by Peter Wemm who sent it to me
  a little over a year ago now).  The next thing it does is write a page
  table for the kernel pages that would - given the right pde - map the
  entire kernel in 4K pages.  However, if PSE is enabled, the page
  table is not used[*] and instead a pde for a 4M page (or several)
  mapping the kernel in its entirety is[are] written before paging is
  ever turned on (that's it, in a nutshell).  I don't want to go into
  much detail on the rationale of all this unless you are really
  interested, at even then only to a point (please Email privately).

[*] In some cases it's still used for a P==V mapping required to boot
    the APs.

  Ok, so now that all that is out of the way, I'd like to request a
  massive movement to help test this change. :-)

What has been done:
  
  Georg-W. Koltermann <g.w.k [at] web.de> - who has experienced the
  above-mentionned data corruption problems on his P4, was able to
  reproduce them fairly reliably, and has been running with DISABLE_PSE
  and DISABLE_PG_G - has tested the patch and confirmed that it solves
  his data corruption problem without having to require the two DISABLE
  options.

  I have tested this patch (and have been running with it for a week
  now) on my -current workstation at the office (although I never
  experienced the aformentionned data corruption problems).

  Robert Watson has tested this patch on one of his SMP machines and it
  has now been confirmed to work there as well.

  We have not yet tested this patch with PAE turned on.

  
What needs to be done:

  If you've had random data corruption problems while running -current
  on Intel, please try this patch WITHOUT the DISABLE_PSE _nor_ the
  DISABLE_PG_G options in your kernel and let me know if the problem
  goes away.  (If it does not go away, then suspect bad RAM...)

  If you have -current running on Intel, PLEASE TEST THIS patch.  This
  includes UP, SMP, PAE, and any combination of said features.  With
  regards to PAE in particular, I'd like to know if you can boot this
  thing as I have not yet attempted it.

  The information I'd like to see, if possible (private mail is OK):

  1) Your hardware, UP or SMP.
  
  2) Whether you have PAE turned on.
  
  3) If you had the data corruption problem, does the patch solve it?
     Please make sure you do NOT include the DISABLE_PSE and
     DISABLE_PG_G options when you test for this, as they should no
     longer be needed.

  If for some reason you continue to see data corruption problems with
  this patch, then make sure that they really do go away with
  DISABLE_PSE and DISABLE_PG_G before you contact me; it could just be
  bad RAM.

  If I commit this, I want it to have the longest "Tested By" list ever
  written in a commit message to date. Please help. :-)
  
Regards,
-- 
Bosko Milekic  *  bmilekic@technokratis.com  *  bmilekic@FreeBSD.org
TECHNOkRATIS Consulting Services  *  http://www.technokratis.com/

--RnlQjJ0d97Da+TV1
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="4mb_new.diff"

--- //depot/vendor/freebsd/src/sys/conf/ldscript.i386	2003/01/05 23:40:18
+++ //depot/user/bmilekic/intelsux/src/sys/conf/ldscript.i386	2003/08/01 10:19:29
@@ -6,7 +6,7 @@
 SECTIONS
 {
   /* Read-only sections, merged into text segment: */
-  . = kernbase + 0x00100000 + SIZEOF_HEADERS;
+  . = kernbase + kernload + SIZEOF_HEADERS;
   .interp     : { *(.interp) 	}
   .hash          : { *(.hash)		}
   .dynsym        : { *(.dynsym)		}
--- //depot/vendor/freebsd/src/sys/i386/i386/genassym.c	2003/06/27 14:55:39
+++ //depot/user/bmilekic/intelsux/src/sys/i386/i386/genassym.c	2003/08/07 12:15:39
@@ -111,13 +111,16 @@
 ASSYM(NPGPTD, NPGPTD);
 ASSYM(PDESIZE, sizeof(pd_entry_t));
 ASSYM(PTESIZE, sizeof(pt_entry_t));
+ASSYM(PDESHIFT, PDESHIFT);
 ASSYM(PTESHIFT, PTESHIFT);
 ASSYM(PAGE_SHIFT, PAGE_SHIFT);
 ASSYM(PAGE_MASK, PAGE_MASK);
 ASSYM(PDRSHIFT, PDRSHIFT);
+ASSYM(PDRMASK, PDRMASK);
 ASSYM(USRSTACK, USRSTACK);
 ASSYM(VM_MAXUSER_ADDRESS, VM_MAXUSER_ADDRESS);
 ASSYM(KERNBASE, KERNBASE);
+ASSYM(KERNLOAD, KERNLOAD);
 ASSYM(MCLBYTES, MCLBYTES);
 ASSYM(PCB_CR3, offsetof(struct pcb, pcb_cr3));
 ASSYM(PCB_EDI, offsetof(struct pcb, pcb_edi));
--- //depot/vendor/freebsd/src/sys/i386/i386/locore.s	2003/07/29 15:00:47
+++ //depot/user/bmilekic/intelsux/src/sys/i386/i386/locore.s	2003/08/08 09:26:56
@@ -87,10 +87,12 @@
 #endif /* SMP */
 
 /*
- * Compiled KERNBASE location
+ * Compiled KERNBASE location and the kernel load address
  */
 	.globl	kernbase
 	.set	kernbase,KERNBASE
+	.globl	kernload
+	.set	kernload,KERNLOAD
 
 /*
  * Globals
@@ -731,9 +733,9 @@
 	je	no_kernend
 	movl	%edi,%esi
 no_kernend:
-	
-	addl	$PAGE_MASK,%esi
-	andl	$~PAGE_MASK,%esi
+
+	addl	$PDRMASK,%esi		/* Play conservative for now, and */
+	andl	$~PDRMASK,%esi		/*   ... wrap to next 4M. */
 	movl	%esi,R(KERNend)		/* save end of kernel */
 	movl	%esi,R(physfree)	/* next free page is at end of kernel */
 
@@ -783,16 +785,46 @@
 	movl	%esi, R(SMPpt)		/* relocated to KVM space */
 #endif	/* SMP */
 
-/* Map read-only from zero to the end of the kernel text section */
+/* Map read-only from zero to the beginning of the kernel text section */
 	xorl	%eax, %eax
 	xorl	%edx,%edx
-	movl	$R(etext),%ecx
+	movl	$R(btext),%ecx
 	addl	$PAGE_MASK,%ecx
 	shrl	$PAGE_SHIFT,%ecx
 	fillkptphys(%edx)
 
-/* Map read-write, data, bss and symbols */
-	movl	$R(etext),%eax
+/*
+ * Enable PSE and PGE.
+ */
+#ifndef DISABLE_PSE
+	testl	$CPUID_PSE, R(cpu_feature)
+	jz	1f
+	movl	$PG_PS, R(pseflag)
+	movl	%cr4, %eax
+	orl	$CR4_PSE, %eax
+	movl	%eax, %cr4
+1:
+#endif
+#ifndef DISABLE_PG_G
+	testl	$CPUID_PGE, R(cpu_feature)
+	jz	2f
+	movl	$PG_G, R(pgeflag)
+	movl	%cr4, %eax
+	orl	$CR4_PGE, %eax
+	movl	%eax, %cr4
+2:
+#endif
+
+/*
+ * Write page tables for the kernel starting at btext and
+ * until the end.  Make sure to map read+write.  We do this even
+ * if we've enabled PSE above, we'll just switch the corresponding kernel
+ * PDEs before we turn on paging.
+ *
+ * XXX: We waste some pages here in the PSE case!  DON'T BLINDLY REMOVE
+ * THIS!  SMP needs the page table to be there to map the kernel P==V.
+ */
+	movl	$R(btext),%eax
 	addl	$PAGE_MASK, %eax
 	andl	$~PAGE_MASK, %eax
 	movl	$PG_RW,%edx
@@ -881,12 +913,33 @@
 	movl	$NKPT, %ecx
 	fillkpt(R(IdlePTD), $PG_RW)
 
-/* install pde's for pt's */
+/*
+ * For the non-PSE case, install PDEs for PTs covering the kernel.
+ * For the PSE case, do the same, but clobber the ones corresponding
+ * to the kernel (from btext to KERNend) with 4M ('PS') PDEs immediately
+ * after.
+ */
 	movl	R(KPTphys), %eax
 	movl	$KPTDI, %ebx
 	movl	$NKPT, %ecx
 	fillkpt(R(IdlePTD), $PG_RW)
+	cmpl	$0,R(pseflag)
+	je	done_pde
 
+	movl	R(KERNend), %ecx
+	movl	$KERNLOAD, %eax
+	subl	%eax, %ecx
+	shrl	$PDRSHIFT, %ecx
+	movl	$(KPTDI+(KERNLOAD/(1 << PDRSHIFT))), %ebx
+	shll	$PDESHIFT, %ebx
+	addl	R(IdlePTD), %ebx
+	orl	$(PG_V|PG_RW|PG_PS), %eax
+1:	movl	%eax, (%ebx)
+	addl	$(1 << PDRSHIFT), %eax
+	addl	$PDESIZE, %ebx
+	loop	1b
+
+done_pde:
 /* install a pde recursively mapping page directory as a page table */
 	movl	R(IdlePTD), %eax
 	movl	$PTDPTDI, %ebx
--- //depot/vendor/freebsd/src/sys/i386/i386/machdep.c	2003/07/30 18:30:31
+++ //depot/user/bmilekic/intelsux/src/sys/i386/i386/machdep.c	2003/08/01 10:19:29
@@ -1798,7 +1798,7 @@
 			/*
 			 * block out kernel memory as not available.
 			 */
-			if (pa >= 0x100000 && pa < first)
+			if (pa >= KERNLOAD && pa < first)
 				continue;
 	
 			page_bad = FALSE;
--- //depot/vendor/freebsd/src/sys/i386/i386/mp_machdep.c	2003/07/23 12:05:27
+++ //depot/user/bmilekic/intelsux/src/sys/i386/i386/mp_machdep.c	2003/08/11 06:31:16
@@ -533,8 +533,6 @@
 	cr0 = rcr0();
 	cr0 &= ~(CR0_CD | CR0_NW | CR0_EM);
 	load_cr0(cr0);
-
-	pmap_set_opt();
 }
 
 
@@ -923,7 +921,6 @@
 	int     type;
 	int     apic, bus, cpu, intr;
 	int	i, j;
-	int	pgeflag;
 
 	POSTCODE(MPTABLE_PASS2_POST);
 
@@ -932,8 +929,6 @@
 	proc.type = 0;
 	proc.cpu_flags = PROCENTRY_FLAG_EN;
 
-	pgeflag = 0;		/* XXX - Not used under SMP yet.  */
-
 	MALLOC(io_apic_versions, u_int32_t *, sizeof(u_int32_t) * mp_napics,
 	    M_DEVBUF, M_WAITOK);
 	MALLOC(ioapic, volatile ioapic_t **, sizeof(ioapic_t *) * mp_napics,
@@ -958,7 +953,7 @@
 			/* use this slot if available */
 			if (((vm_offset_t)SMPpt[NPTEPG-2-j] & PG_FRAME) == 0) {
 				SMPpt[NPTEPG-2-j] = (pt_entry_t)(PG_V | PG_RW |
-				    pgeflag | (io_apic_address[i] & PG_FRAME));
+				    (io_apic_address[i] & PG_FRAME));
 				ioapic[i] = (ioapic_t *)((u_int)SMP_prvspace
 					+ (NPTEPG-2-j) * PAGE_SIZE
 					+ (io_apic_address[i] & PAGE_MASK));
@@ -2174,7 +2169,6 @@
 
 	for (x = 0; x < NKPT; x++)
 		PTD[x] = 0;
-	pmap_set_opt();
 
 	/* number of APs actually started */
 	return mp_ncpus - 1;
@@ -2282,7 +2276,7 @@
 	/* setup common fields for subsequent IPIs */
 	icr_lo = lapic.icr_lo & APIC_ICRLO_RESV_MASK;
 	icr_lo |= APIC_DESTMODE_PHY;
-	
+
 	/* do an INIT IPI: assert RESET */
 	lapic.icr_lo = icr_lo | APIC_DEST_DESTFLD | APIC_TRIGMOD_EDGE |
 	    APIC_LEVEL_ASSERT | APIC_DELMODE_INIT;
--- //depot/vendor/freebsd/src/sys/i386/i386/mpboot.s	2003/03/29 21:25:29
+++ //depot/user/bmilekic/intelsux/src/sys/i386/i386/mpboot.s	2003/08/08 10:29:41
@@ -86,6 +86,22 @@
 	movl	R(IdlePTD), %eax
 	movl	%eax,%cr3	
 #endif
+#ifndef DISABLE_PSE
+	cmpl	$0, R(pseflag)
+	je	1f
+	movl	%cr4, %eax
+	orl	$CR4_PSE, %eax
+	movl	%eax, %cr4
+1:
+#endif
+#ifndef DISABLE_PGE
+	cmpl	$0, R(pgeflag)
+	je	2f
+	movl	%cr4, %eax
+	orl	$CR4_PGE, %eax
+	movl	%eax, %cr4
+2:
+#endif
 	movl	%cr0,%eax
 	orl	$CR0_PE|CR0_PG,%eax		/* enable paging */
 	movl	%eax,%cr0			/* let the games begin! */
--- //depot/vendor/freebsd/src/sys/i386/i386/pmap.c	2003/08/10 14:55:32
+++ //depot/user/bmilekic/intelsux/src/sys/i386/i386/pmap.c	2003/08/11 06:12:50
@@ -195,8 +195,8 @@
 vm_offset_t virtual_avail;	/* VA of first avail page (after kernel bss) */
 vm_offset_t virtual_end;	/* VA of last avail page (end of kernel AS) */
 static boolean_t pmap_initialized = FALSE;	/* Has pmap_init completed? */
-static int pgeflag;		/* PG_G or-in */
-static int pseflag;		/* PG_PS or-in */
+int pgeflag = 0;		/* PG_G or-in */
+int pseflag = 0;		/* PG_PS or-in */
 
 static int nkpt;
 vm_offset_t kernel_vm_end;
@@ -261,8 +261,6 @@
 static void *pmap_pdpt_allocf(uma_zone_t zone, int bytes, u_int8_t *flags, int wait);
 #endif
 
-static pd_entry_t pdir4mb;
-
 CTASSERT(1 << PDESHIFT == sizeof(pd_entry_t));
 CTASSERT(1 << PTESHIFT == sizeof(pt_entry_t));
 
@@ -286,7 +284,7 @@
 #endif
 #ifndef DISABLE_PSE
 	if (cpu_feature & CPUID_PSE)
-		newaddr = (addr + (NBPDR - 1)) & ~(NBPDR - 1);
+		newaddr = (addr + PDRMASK) & ~PDRMASK;
 #endif
 	return newaddr;
 }
@@ -398,11 +396,6 @@
 	for (i = 0; i < NKPT; i++)
 		PTD[i] = 0;
 
-	pgeflag = 0;
-#ifndef DISABLE_PG_G
-	if (cpu_feature & CPUID_PGE)
-		pgeflag = PG_G;
-#endif
 #ifdef I686_CPU_not	/* Problem seems to have gone away */
 	/* Deal with un-resolved Pentium4 issues */
 	if (cpu_class == CPUCLASS_686 &&
@@ -412,21 +405,7 @@
 		pgeflag = 0;
 	}
 #endif
-	
-/*
- * Initialize the 4MB page size flag
- */
-	pseflag = 0;
-/*
- * The 4MB page version of the initial
- * kernel page mapping.
- */
-	pdir4mb = 0;
 
-#ifndef DISABLE_PSE
-	if (cpu_feature & CPUID_PSE)
-		pseflag = PG_PS;
-#endif
 #ifdef I686_CPU_not	/* Problem seems to have gone away */
 	/* Deal with un-resolved Pentium4 issues */
 	if (cpu_class == CPUCLASS_686 &&
@@ -436,26 +415,10 @@
 		pseflag = 0;
 	}
 #endif
-#ifndef DISABLE_PSE
-	if (pseflag) {
-		pd_entry_t ptditmp;
-		/*
-		 * Note that we have enabled PSE mode
-		 */
-		ptditmp = *(PTmap + i386_btop(KERNBASE));
-		ptditmp &= ~(NBPDR - 1);
-		ptditmp |= PG_V | PG_RW | PG_PS | PG_U | pgeflag;
-		pdir4mb = ptditmp;
-	}
-#endif
-#ifndef SMP
-	/*
-	 * Turn on PGE/PSE.  SMP does this later on since the
-	 * 4K page tables are required for AP boot (for now).
-	 * XXX fixme.
-	 */
-	pmap_set_opt();
-#endif
+
+	/* Turn on PG_G on kernel page(s) */
+	pmap_set_pg();
+
 #ifdef SMP
 	if (cpu_apic_address == 0)
 		panic("pmap_bootstrap: no local apic! (non-SMP hardware?)");
@@ -468,55 +431,41 @@
 }
 
 /*
- * Enable 4MB page mode for MP startup.  Turn on PG_G support.
- * BSP will run this after all the AP's have started up.
+ * Set PG_G on kernel pages.  Only the BSP calls this when SMP is turned on.
  */
 void
-pmap_set_opt(void)
+pmap_set_pg(void)
 {
+	pd_entry_t pdir;
 	pt_entry_t *pte;
 	vm_offset_t va, endva;
+	int i; 
+
+	if (pgeflag == 0)
+		return;
 
-	if (pgeflag && (cpu_feature & CPUID_PGE)) {
-		load_cr4(rcr4() | CR4_PGE);
-		invltlb();		/* Insurance */
-	}
-#ifndef DISABLE_PSE
-	if (pseflag && (cpu_feature & CPUID_PSE)) {
-		load_cr4(rcr4() | CR4_PSE);
-		invltlb();		/* Insurance */
-	}
-#endif
-	if (PCPU_GET(cpuid) == 0) {
-#ifndef DISABLE_PSE
-		if (pdir4mb) {
-			kernel_pmap->pm_pdir[KPTDI] = PTD[KPTDI] = pdir4mb;
-			invltlb();	/* Insurance */
+	i = KERNLOAD/NBPDR;
+	endva = KERNBASE + KERNend;
+
+	if (pseflag) {
+		va = KERNBASE + KERNLOAD;
+		while (va  < endva) {
+			pdir = kernel_pmap->pm_pdir[KPTDI+i];
+			pdir |= pgeflag;
+			kernel_pmap->pm_pdir[KPTDI+i] = PTD[KPTDI+i] = pdir;
+			invltlb();	/* Play it safe, invltlb() every time */
+			i++;
+			va += NBPDR;
 		}
-#endif
-		if (pgeflag) {
-			/* Turn on PG_G for text, data, bss pages. */
-			va = (vm_offset_t)btext;
-#ifndef DISABLE_PSE
-			if (pseflag && (cpu_feature & CPUID_PSE)) {
-				if (va < KERNBASE + (1 << PDRSHIFT))
-					va = KERNBASE + (1 << PDRSHIFT);
-			}
-#endif
-			endva = KERNBASE + KERNend;
-			while (va < endva) {
-				pte = vtopte(va);
-				if (*pte)
-					*pte |= pgeflag;
-				va += PAGE_SIZE;
-			}
-			invltlb();	/* Insurance */
+	} else {
+		va = (vm_offset_t)btext;
+		while (va < endva) {
+			pte = vtopte(va);
+			if (*pte)
+				*pte |= pgeflag;
+			invltlb();	/* Play it safe, invltlb() every time */
+			va += PAGE_SIZE;
 		}
-		/*
-		 * We do not need to broadcast the invltlb here, because
-		 * each AP does it the moment it is released from the boot
-		 * lock.  See ap_init().
-		 */
 	}
 }
 
@@ -3180,7 +3129,7 @@
 		return addr;
 	}
 
-	addr = (addr + (NBPDR - 1)) & ~(NBPDR - 1);
+	addr = (addr + PDRMASK) & ~PDRMASK;
 	return addr;
 }
 
--- //depot/vendor/freebsd/src/sys/i386/include/pmap.h	2003/04/28 13:37:59
+++ //depot/user/bmilekic/intelsux/src/sys/i386/include/pmap.h	2003/08/08 10:29:41
@@ -331,6 +331,8 @@
 extern vm_offset_t clean_eva;
 extern vm_offset_t clean_sva;
 extern vm_paddr_t phys_avail[];
+extern int pseflag;
+extern int pgeflag;
 extern char *ptvmmap;		/* poor name! */
 extern vm_offset_t virtual_avail;
 extern vm_offset_t virtual_end;
@@ -341,7 +343,7 @@
 void	*pmap_mapdev(vm_paddr_t, vm_size_t);
 void	pmap_unmapdev(vm_offset_t, vm_size_t);
 pt_entry_t *pmap_pte_quick(pmap_t, vm_offset_t) __pure2;
-void	pmap_set_opt(void);
+void	pmap_set_pg(void);
 void	pmap_invalidate_page(pmap_t, vm_offset_t);
 void	pmap_invalidate_range(pmap_t, vm_offset_t, vm_offset_t);
 void	pmap_invalidate_all(pmap_t);
--- //depot/vendor/freebsd/src/sys/i386/include/vmparam.h	2003/04/30 17:11:31
+++ //depot/user/bmilekic/intelsux/src/sys/i386/include/vmparam.h	2003/08/01 10:19:29
@@ -84,6 +84,13 @@
 
 
 /*
+ * Kernel physical load address.
+ */
+#ifndef KERNLOAD
+#define	KERNLOAD		0x400000
+#endif
+
+/*
  * Virtual addresses of things.  Derived from the page directory and
  * page table indexes from pmap.h for precision.
  * Because of the page that is both a PD and PT, it looks a little

--RnlQjJ0d97Da+TV1--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20030811100549.GA33392>