From owner-svn-src-all@FreeBSD.ORG Wed Aug 12 12:20:08 2009 Return-Path: Delivered-To: svn-src-all@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DDAC1106564A; Wed, 12 Aug 2009 12:20:08 +0000 (UTC) (envelope-from bz@FreeBSD.org) Received: from mail.cksoft.de (mail.cksoft.de [195.88.108.3]) by mx1.freebsd.org (Postfix) with ESMTP id 7A91B8FC3D; Wed, 12 Aug 2009 12:20:08 +0000 (UTC) Received: from localhost (amavis.fra.cksoft.de [192.168.74.71]) by mail.cksoft.de (Postfix) with ESMTP id DDC6B41C705; Wed, 12 Aug 2009 14:20:06 +0200 (CEST) X-Virus-Scanned: amavisd-new at cksoft.de Received: from mail.cksoft.de ([195.88.108.3]) by localhost (amavis.fra.cksoft.de [192.168.74.71]) (amavisd-new, port 10024) with ESMTP id gNCxpKJvHfYB; Wed, 12 Aug 2009 14:20:05 +0200 (CEST) Received: by mail.cksoft.de (Postfix, from userid 66) id 9479641C69F; Wed, 12 Aug 2009 14:20:05 +0200 (CEST) Received: from maildrop.int.zabbadoz.net (maildrop.int.zabbadoz.net [10.111.66.10]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.int.zabbadoz.net (Postfix) with ESMTP id 2C94E4448EC; Wed, 12 Aug 2009 12:18:53 +0000 (UTC) Date: Wed, 12 Aug 2009 12:18:52 +0000 (UTC) From: "Bjoern A. Zeeb" X-X-Sender: bz@maildrop.int.zabbadoz.net To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-stable@freebsd.org, svn-src-stable-other@freebsd.org In-Reply-To: <200908121032.n7CAWKEM058850@svn.freebsd.org> Message-ID: <20090812121814.M93661@maildrop.int.zabbadoz.net> References: <200908121032.n7CAWKEM058850@svn.freebsd.org> X-OpenPGP-Key: 0x14003F198FEFA3E77207EE8D2B58B8F83CCF1842 MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Subject: Re: svn commit: r196119 - in stable/8/sys: . amd64/include/xen cddl/contrib/opensolaris contrib/dev/acpica contrib/pf dev/ata dev/cxgb dev/sound/usb dev/usb dev/usb/controller dev/usb/input dev/usb/mis... X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 12 Aug 2009 12:20:09 -0000 On Wed, 12 Aug 2009, Bjoern A. Zeeb wrote: > Author: bz > Date: Wed Aug 12 10:32:20 2009 > New Revision: 196119 > URL: http://svn.freebsd.org/changeset/base/196119 > > Log: > MFC r196118: > Put minimum alignment on the dpcpu and vnet section so that ld > when adding the __start_ symbol knows the expected section alignment > and can place the __start_ symbol correctly. > > These sections will not support symbols with super-cache line alignment > requirements. > > For full details, see posting to freebsd-current, 2009-08-10, > Message-ID: <20090810133111.C93661@maildrop.int.zabbadoz.net>. > > Debugging and testing patches by: > Kamigishi Rei (spambox haruhiism.net), > np, lstewart, jhb, kib, rwatson > Tested by: Kamigishi Rei, lstewart > Reviewed by: kib > > Approved by: re ... Just as a follow-up here as well: [ The following samples, etc are generally "thinking" arch=amd64: ] There are two different kinds of places where we find dynamic per-cpu (dpcpu) data: (1) the so called 'master copy', that is a linker set, which holds the BSS initialized compile-time defaults. (2) a copy for each PU copied to allocated memory. The problem seen has been that single members of the set had been un-aligned at run-time. Dumping the linker set (master copy), things look like this for example: ffffffff8168f8e9 g *ABS* 0000000000000000 __start_set_pcpu ffffffff8168f900 l d set_pcpu 0000000000000000 ffffffff8168f900 g O set_pcpu 0000000000000068 pcpu_entry_sched_switch_stats ffffffff8168f980 l O set_pcpu 0000000000000800 pcpu_entry_modspace ffffffff81690180 g O set_pcpu 0000000000000038 pcpu_entry_epair_dpcpu ffffffff81690200 g O set_pcpu 0000000000000500 pcpu_entry_nws ffffffff81690700 g *ABS* 0000000000000000 __stop_set_pcpu The members of the linker set (master copy) are all well aligned within the set: for example pcpu_entry_nws: 0xffffffff81690200 % 128 = 0 Looking at elfdump for the kernel this is also what we would expect: entry: 32 sh_name: set_pcpu sh_type: SHT_PROGBITS sh_flags: SHF_WRITE|SHF_ALLOC sh_addr: 0xffffffff8168f900 sh_offset: 21559552 sh_size: 3584 sh_link: 0 sh_info: 0 sh_addralign: 128 <<<< sh_entsize: 0 The problem is with __start_set_pcpu, the symbol ld adds to mark the beginning of the section. The address of __start_set_pcpu is not well-aligned, not even pointer-aligned: 0xffffffff8168f8e9 % 8 = 1. When now copying the 'master copy' to a dpcpu area the aligned symbols become un-aligned. Example: dpcpu area starts at 0xffff0000 |--------+------------------------------------------ Copyin the master copy from the objdump above starting at __start_set_pcpu will put __start_set_pcpu at 0xffff0000 but the first symbol pcpu_entry_sched_switch_stats at 0xffff0017 0xffff0000 |--------+------------------------------------------ |~~~~~~~~|------------------------------------------======== 0xffff0017 Two problems become obvious: (1) the symbols are now un-aligned in the per-cpu area. (2) due to the offset we did not copy the entire dpcpu area, so some symbols at the end might not have been properly initialized. While (2) may lead to run-time problems it usually is not a problem with memory corrution as the dpcpu area is usually allocated in pages. So unless the dpcpu symbols end page aligned there should be no corruption. (1) in contrast may lead to other effects like a lock spanning multiple cache lines thus no longer permitting atomic reads and being open to races. The results are panics with truncated pointers trying to access invalid memory regions, etc. On architectures like arm, this will immediatly fault as arm does not allow un-aligned reads. So one solution to the problem would be to make sure we allocate enough memory to also account for the offset to proper alignment and then copying the 'master copy' to a possibly unaligned address as well making the symbols properly aligned again: dpcpu area starts at 0xffff0000 |--------+---------+------------------------------------... |*** unused *******|~~~~~~~~|---------------------------... 0xffff0069 | 0xffff0080 In this sample __start_set_pcpu would be at 0xffff0069 and pcpu_entry_sched_switch_stats at 0xffff0080 and thus properly aligned again. With this things will work. Looking further at the problem you may have already noticed that the section for the 'master copy' starts at 0xffffffff8168f900 and that the __start_set_pcpu is outside of that section at 0xffffffff8168f8e9. Looking at a section dump from `readelf -S kernel` you would notice that the __start_set_pcpu directly follows the end of the previous section. The reasons for this are twofold: (1) ld places the __start_ symbol at '.' (the location counter), which at that point is at the end of the old section as the new (set_pcpu) is not yet added with __start_set_pcpu = ALIGN(0). (2) because we manually define the section, so that it is writeable, ld at the point of writing the __start_ symbol does not yet have any possible section alignment information. That is the reason for the ALIGN(0) in (1). An expected behaviour would be for ld to put the *ABS* at the address where the section begins, once known or fixup the address. This could arguably be a bug in ld we should fix post-8.0. One possible workaround would be to force the __start_ symbol and the section to be equally aligned and thus on the same address using linker scripts. The drawbacks are that we need to touch the fragile linker scripts for each of the sections we add and for all architectures individually. As the enforcement of alignment would be at a different place to the actual set creation, putting the alignment in might be easily forgotten. The advantage would be that we can always be sure that __start_ would be on the same address where the section starts. Another solution is to put minimum alignment on the objects inside the section in a way that it is only in a single place in the source code. The section directive in the respective header file, that will be included by each implementation file, is the ideal place for this. While cache line alignment seems to be the widest alignment restriction currently in use, one drawback, like with above ldscript solution, is that a symbol could possibly enforce a wider alignment restriction onto the section making the __start_ symbol and the section beginning to diverge again. Example: 0xffffffff8168f700 __start_set_pcpu 0xffffffff8168f800 set_pcpu 0xffffffff8168f800 pcpu_entry_sched_switch_stats .. if we would put an alignment of 1024 on pcpu_entry_sched_switch_stats. This is unlikely to happen. With the minimum alignment, ld, at the time of placing the __start_ symbol, already knows about the section alignment and will place it correctly on the section beginning doing: __start_set_pcpu = ALIGN(CACHE_LINE_SHIFT) at ".". Summary: The minimum alignment seems to be the least-intrusive solution and is believed to work for the moment. In addition documenting that the dpcpu and similar sections will not support super-cache line alignment. The long term solution would be to fix ld to DTRT. > > Modified: stable/8/sys/net/vnet.h > ============================================================================== > --- stable/8/sys/net/vnet.h Wed Aug 12 10:26:03 2009 (r196118) > +++ stable/8/sys/net/vnet.h Wed Aug 12 10:32:20 2009 (r196119) > @@ -185,12 +185,14 @@ extern struct sx vnet_sxlock; > * Virtual network stack memory allocator, which allows global variables to > * be automatically instantiated for each network stack instance. > */ > +__asm__( > #if defined(__arm__) > -__asm__(".section " VNET_SETNAME ", \"aw\", %progbits"); > + ".section " VNET_SETNAME ", \"aw\", %progbits\n" > #else > -__asm__(".section " VNET_SETNAME ", \"aw\", @progbits"); > + ".section " VNET_SETNAME ", \"aw\", @progbits\n" > #endif > -__asm__(".previous"); > + "\t.p2align " __XSTRING(CACHE_LINE_SHIFT) "\n" > + "\t.previous"); > > #define VNET_NAME(n) vnet_entry_##n > #define VNET_DECLARE(t, n) extern t VNET_NAME(n) > > Modified: stable/8/sys/sys/pcpu.h > ============================================================================== > --- stable/8/sys/sys/pcpu.h Wed Aug 12 10:26:03 2009 (r196118) > +++ stable/8/sys/sys/pcpu.h Wed Aug 12 10:32:20 2009 (r196119) > @@ -56,12 +56,14 @@ struct thread; > extern uintptr_t *__start_set_pcpu; > extern uintptr_t *__stop_set_pcpu; > > +__asm__( > #if defined(__arm__) > -__asm__(".section set_pcpu, \"aw\", %progbits"); > + ".section set_pcpu, \"aw\", %progbits\n" > #else > -__asm__(".section set_pcpu, \"aw\", @progbits"); > + ".section set_pcpu, \"aw\", @progbits\n" > #endif > -__asm__(".previous"); > + "\t.p2align " __XSTRING(CACHE_LINE_SHIFT) "\n" > + "\t.previous"); > > /* > * Array of dynamic pcpu base offsets. Indexed by id. > -- Bjoern A. Zeeb The greatest risk is not taking one.