Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 10 Mar 2010 13:27:53 +0200
From:      Kostik Belousov <kostikbel@gmail.com>
To:        Kevin Day <toasty@dragondata.com>
Cc:        freebsd-hackers@freebsd.org
Subject:   Re: Extremely slow boot on VMWare with Opteron 2352 (acpi?)
Message-ID:  <20100310112753.GW2489@deviant.kiev.zoral.com.ua>
In-Reply-To: <207B4180-B8AF-4C93-8BC7-7F1FFEEBB713@dragondata.com>
References:  <2C7A849F-2571-48E7-AA75-B6F87C2352C1@dragondata.com> <201003091727.09188.jhb@freebsd.org> <207B4180-B8AF-4C93-8BC7-7F1FFEEBB713@dragondata.com>

next in thread | previous in thread | raw e-mail | index | archive | help

--nrCuQK91QKw8CgBg
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Tue, Mar 09, 2010 at 06:42:02PM -0600, Kevin Day wrote:
>=20
> On Mar 9, 2010, at 4:27 PM, John Baldwin wrote:
>=20
> > On Tuesday 09 March 2010 3:40:26 pm Kevin Day wrote:
> >>=20
> >>=20
> >> If I boot up on an Opteron 2218 system, it boots normally. If I boot t=
he=20
> > exact same VM moved to a 2352, I get:
> >>=20
> >> acpi0: <INTEL 440BX> on motherboard
> >> PCIe: Memory Mapped configuration base @ 0xe0000000
> >>   (very long pause)
> >> ioapic0: routing intpin 9 (ISA IRQ 9) to lapic 0 vector 48
> >> acpi0: [MPSAFE]
> >> acpi0: [ITHREAD]
> >>=20
> >> then booting normally.
> >=20
> > It's probably worth adding some printfs to narrow down where the pause =
is=20
> > happening.  This looks to be all during the acpi_attach() routine, so m=
aybe=20
> > you can start there.
>=20
> Okay, good pointer. This is what I've narrowed down:
>=20
> acpi_enable_pcie() calls pcie_cfgregopen(). It's called here with pcie_cf=
gregopen(0xe0000000, 0, 255). inside pcie_cfgregopen, the pause starts here:
>=20
>         /* XXX: We should make sure this really fits into the direct map.=
 */
>         pcie_base =3D (vm_offset_t)pmap_mapdev(base, (maxbus + 1) << 20);
>=20
> pmap_mapdev calls pmap_mapdev_attr, and in there this evaluates to true:
>=20
>         /*
>          * If the specified range of physical addresses fits within the d=
irect
>          * map window, use the direct map.=20
>          */
>         if (pa < dmaplimit && pa + size < dmaplimit) {
>=20
> so we call pmap_change_attr which called pmap_change_attr_locked. It's ch=
anging 0x10000000 bytes starting at 0xffffff00e0000000.  The very last line=
 before returning from pmap_change_attr_locked is:
>=20
>                 pmap_invalidate_cache_range(base, tmpva);
>=20
> And this is where the delay is. This is calling MFENCE/CLFLUSH in a loop =
8 million times. We actually had a problem with CLFLUSH causing panics on t=
hese same CPUs under Xen, which is partially why we're looking at VMware no=
w. (see kern/138863). I'm wondering if VMware didn't encounter the same pro=
blem and replace CLFLUSH with a software emulated version that is far slowe=
r... based on the speed is probably invalidating the entire cache. A quick =
change to pmap_invalidate_cache_range to just clear the entire cache if the=
 area being cleared is over 8MB seems to have fixed it. i.e.:
>=20
>         else if (cpu_feature & CPUID_CLFSH)  {
>=20
> to
>=20
>         else if ((cpu_feature & CPUID_CLFSH) && ((eva-sva) < (2<<22))) {
>=20
>=20
> However, I'm a little blurry on if everything leading to this point is co=
rrect. It's ending up with 256MB of memory for the pci area, which seems re=
ally excessive. Is the problem just that it wants room for 256 busses, or..=
.? Anyone know this code path well enough to know if this is deviating from=
 the norm?

I think that the idea not to for CLFLUSH in the loop for large regions
is good. We do not extract the L2/L3 cache size now, I suppose that 2MB
estimation is good for most situations.

commit bbac1632d349d68b905df644656ce9a8e4aed094
Author: Konstantin Belousov <kostik@pooma.home>
Date:   Wed Mar 10 13:07:51 2010 +0200

    Fall back to wbinvd when region for CLFLUSH is >=3D 2MB.
   =20
    Submitted by:	Kevin Day <toasty@dragondata.com>

diff --git a/sys/amd64/amd64/pmap.c b/sys/amd64/amd64/pmap.c
index 07db5d1..4361be0 100644
--- a/sys/amd64/amd64/pmap.c
+++ b/sys/amd64/amd64/pmap.c
@@ -994,7 +994,8 @@ pmap_invalidate_cache_range(vm_offset_t sva, vm_offset_=
t eva)
=20
 	if (cpu_feature & CPUID_SS)
 		; /* If "Self Snoop" is supported, do nothing. */
-	else if (cpu_feature & CPUID_CLFSH) {
+	else if ((cpu_feature & CPUID_CLFSH) !=3D 0 &&
+		 eva - sva < 2 * 1024 * 1024) {
=20
 		/*
 		 * Otherwise, do per-cache line flush.  Use the mfence
@@ -1011,7 +1012,8 @@ pmap_invalidate_cache_range(vm_offset_t sva, vm_offse=
t_t eva)
=20
 		/*
 		 * No targeted cache flush methods are supported by CPU,
-		 * globally invalidate cache as a last resort.
+		 * or the supplied range is bigger then 2MB.
+		 * Globally invalidate cache.
 		 */
 		pmap_invalidate_cache();
 	}
diff --git a/sys/i386/i386/pmap.c b/sys/i386/i386/pmap.c
index 4b2e34f..f448071 100644
--- a/sys/i386/i386/pmap.c
+++ b/sys/i386/i386/pmap.c
@@ -996,7 +996,8 @@ pmap_invalidate_cache_range(vm_offset_t sva, vm_offset_=
t eva)
=20
 	if (cpu_feature & CPUID_SS)
 		; /* If "Self Snoop" is supported, do nothing. */
-	else if (cpu_feature & CPUID_CLFSH) {
+	else if ((cpu_feature & CPUID_CLFSH) !=3D 0 &&
+		 eva - sva < 2 * 1024 * 1024) {
=20
 		/*
 		 * Otherwise, do per-cache line flush.  Use the mfence
@@ -1013,7 +1014,8 @@ pmap_invalidate_cache_range(vm_offset_t sva, vm_offse=
t_t eva)
=20
 		/*
 		 * No targeted cache flush methods are supported by CPU,
-		 * globally invalidate cache as a last resort.
+		 * or the supplied range is bigger then 2MB.
+		 * Globally invalidate cache.
 		 */
 		pmap_invalidate_cache();
 	}

--nrCuQK91QKw8CgBg
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (FreeBSD)

iEYEARECAAYFAkuXgjgACgkQC3+MBN1Mb4h9bgCdHEWAhJgy8etu0V/25HzAUReT
HAQAoOg1b0P04PSDQgGlbHb4Xz+bpXSv
=A58O
-----END PGP SIGNATURE-----

--nrCuQK91QKw8CgBg--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100310112753.GW2489>