Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 10 Aug 2009 13:39:16 +0000 (UTC)
From:      "Bjoern A. Zeeb" <bz@FreeBSD.org>
To:        Robert Watson <rwatson@FreeBSD.org>
Cc:        Jeff Roberson <jeff@FreeBSD.org>, freebsd-current@freebsd.org, kib@FreeBSD.org, Navdeep Parhar <np@FreeBSD.org>, Larry Rosenman <ler@lerctr.org>, lstewart@FreeBSD.org
Subject:   Re: reproducible panic in netisr
Message-ID:  <20090810133111.C93661@maildrop.int.zabbadoz.net>
In-Reply-To: <alpine.BSF.2.00.0908061508520.62916@fledge.watson.org>
References:  <20090804225806.GA54680@hub.freebsd.org> <20090805054115.O93661@maildrop.int.zabbadoz.net> <20090805063417.GA10969@doormat.home> <alpine.BSF.2.00.0908060011490.59996@fledge.watson.org> <alpine.BSF.2.00.0908060834120.21318@thebighonker.lerctr.org> <alpine.BSF.2.00.0908061508520.62916@fledge.watson.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 6 Aug 2009, Robert Watson wrote:

Hi,

> On Thu, 6 Aug 2009, Larry Rosenman wrote:
>
>> On Thu, 6 Aug 2009, Robert Watson wrote:
>> 
>>> On Tue, 4 Aug 2009, Navdeep Parhar wrote:
>>> 
>>>>>> This occurs on today's HEAD + some unrelated patches.  That makes it 
>>>>>> 8.0BETA2+ code.  I haven't tried older builds.
>>>>> 
>>>>> We have finally been able to reproduce this ourselves yesterday and
>>>> 
>>>> Well, it happens every single time on all of my amd64 machines. After I'd 
>>>> already sent my email I noticed that the netisr mutex has an odd address 
>>>> (pun intended :-))
>>>> 
>>>> m=0xffffffff8144d867
>>> 
>>> Heh, indeed.  We just spotted the same result here.  In this case it's 
>>> causing a panic because it leads to a non-atomic read due to mtx_lock 
>>> spanning a cache line boundary, followed shortly by a panic because it's 
>>> not a valid thread pointer when it's dereferenced, as we get a fractional 
>>> pointer.
>> [snip]
>> 
>> Do we have an ETA for a testable patch?
>
> RSN, I'm afraid.  ...

You can fetch the following from .. as well:
http://people.freebsd.org/~bz/20090809-02-pcpu-start-align-fix.diff


I tried to get things summarized a bit; as good as possible from my
pov/knowledge:

------------------------------------------------------------------------
[ The following samples, etc are generally "thinking" arch=amd64: ]

There are two different kinds of places where we find dynamic per-cpu
(dpcpu) data:
(1) the so called 'master copy', that is a linker set, which holds the
     BSS initialized compile-time defaults.
(2) a copy for each PU copied to allocated memory.

The problem seen has been that single members of the set had been
un-aligned at run-time.

Dumping the linker set (master copy), things look like this for example:

ffffffff8168f8e9 g       *ABS*  0000000000000000 __start_set_pcpu
ffffffff8168f900 l    d  set_pcpu       0000000000000000 
ffffffff8168f900 g     O set_pcpu       0000000000000068 pcpu_entry_sched_switch_stats
ffffffff8168f980 l     O set_pcpu       0000000000000800 pcpu_entry_modspace
ffffffff81690180 g     O set_pcpu       0000000000000038 pcpu_entry_epair_dpcpu
ffffffff81690200 g     O set_pcpu       0000000000000500 pcpu_entry_nws
ffffffff81690700 g       *ABS*  0000000000000000 __stop_set_pcpu

The members of the linker set (master copy) are all well aligned within
the set: for example pcpu_entry_nws: 0xffffffff81690200 % 128 = 0

Looking at elfdump for the kernel this is also what we would expect:
entry: 32
         sh_name: set_pcpu
         sh_type: SHT_PROGBITS
         sh_flags: SHF_WRITE|SHF_ALLOC
         sh_addr: 0xffffffff8168f900
         sh_offset: 21559552
         sh_size: 3584
         sh_link: 0
         sh_info: 0
         sh_addralign: 128                 <<<<
         sh_entsize: 0

The problem is with __start_set_pcpu, the symbol ld adds to mark
the beginning of the section. The address of __start_set_pcpu is
not well-aligned, not even pointer-aligned:
0xffffffff8168f8e9 % 8 = 1.

When now copying the 'master copy' to a dpcpu area the aligned
symbols become un-aligned.

Example:

dpcpu area starts at
0xffff0000
     |--------+------------------------------------------

Copyin the master copy from the objdump above starting at
__start_set_pcpu will put __start_set_pcpu at 0xffff0000 but
the first symbol pcpu_entry_sched_switch_stats at 0xffff0017

0xffff0000
     |--------+------------------------------------------
     |~~~~~~~~|------------------------------------------========
          0xffff0017

Two problems become obvious:
(1) the symbols are now un-aligned in the per-cpu area.
(2) due to the offset we did not copy the entire dpcpu area, so
     some symbols at the end might not have been properly initialized.

While (2) may lead to run-time problems it usually is not a problem
with memory corrution as the dpcpu area is usually allocated in
pages. So unless the dpcpu symbols end page aligned there should
be no corruption.

(1) in contrast may lead to other effects like a lock spanning
multiple cache lines thus no longer permitting atomic reads and
being open to races. The results are panics with truncated pointers
trying to access invalid memory regions, etc. On architectures
like arm, this will immediatly fault as arm does not allow
un-aligned reads.


So one solution to the problem would be to make sure we allocate
enough memory to also account for the offset to proper alignment
and then copying the 'master copy' to a possibly unaligned address
as well making the symbols properly aligned again:

dpcpu area starts at
0xffff0000
     |--------+---------+------------------------------------...
     |*** unused *******|~~~~~~~~|---------------------------...
                0xffff0069       |
                            0xffff0080

In this sample __start_set_pcpu would be at 0xffff0069 and
pcpu_entry_sched_switch_stats at 0xffff0080 and thus properly
aligned again.  With this things will work.



Looking further at the problem you may have already noticed that
the section for the 'master copy' starts at 0xffffffff8168f900
and that the __start_set_pcpu is outside of that section at
0xffffffff8168f8e9.
Looking at a section dump from `readelf -S kernel` you would
notice that the __start_set_pcpu directly follows the end of the
previous section.

The reasons for this are twofold:
(1) ld places the __start_ symbol at '.' (the location counter),
which at that point is at the end of the old section as the new
(set_pcpu) is not yet added with __start_set_pcpu = ALIGN(0).
(2) because we manually define the section, so that it is
writeable, ld at the point of writing the __start_ symbol does
not yet have any possible section alignment information. That
is the reason for the ALIGN(0) in (1).

An expected behaviour would be for ld to put the *ABS* at the
address where the section begins, once known or fixup the address.
This could arguably be a bug in ld we should fix post-8.0.

One possible workaround would be to force the __start_ symbol
and the section to be equally aligned and thus on the same address
using linker scripts.  The drawbacks are that we need to touch
the fragile linker scripts for each of the sections we add and
for all architectures individually.  As the enforcement of alignment
would be at a different place to the actual set creation, putting
the alignment in might be easily forgotten.
The advantage would be that we can always be sure that __start_
would be on the same address where the section starts.

Another solution is to put minimum alignment on the objects inside the
section in a way that it is only in a single place in the source code.
The section directive in the respective header file, that will
be included by each implementation file, is the ideal place for this.
While cache line alignment seems to be the widest alignment
restriction currently in use, one drawback, like with above ldscript
solution, is that a symbol could possibly enforce a wider alignment
restriction onto the section making the __start_ symbol and the
section beginning to diverge again. Example:
    0xffffffff8168f700  __start_set_pcpu
    0xffffffff8168f800  set_pcpu
    0xffffffff8168f800  pcpu_entry_sched_switch_stats
    ..
if we would put an alignment of 1024 on pcpu_entry_sched_switch_stats.
This is unlikely to happen.

With the minimum alignment, ld, at the time of placing the
__start_ symbol, already knows about the section alignment
and will place it correctly on the section beginning doing:
__start_set_pcpu = ALIGN(CACHE_LINE_SHIFT) at ".".


Summary: The minimum alignment seems to be the least-intrusive
solution and is believed to work for the moment. In addition
documenting that the dpcpu and similar sections will not support
super-cache line alignment.
The long term solution would be to fix ld to DTRT.


------------------------------------------------------------------------
!
! Put minimum alignment on the dpcpu and vnet section so that ld
! when adding the __start_ symbol knows the expected section alignment
! and can place the __start_ symbol correctly.
!
! These sections will not support symbols with super-cache line alignment
! requirements.
!
! See posting <Msg-ID>, 2009-08-10, to freebsd-current for full
! details.
!
! Debugging and testing patches by:
!    Kamigishi Rei (spambox haruhiism.net), np, lstewart, jhb,
!    kib, rwatson
! Tested by:	Kamigishi Rei, lstewart
! Reviewed by:	kib
! Approved by: 
! 
Index: sys/sys/pcpu.h
===================================================================
--- sys/sys/pcpu.h	(revision 196086)
+++ sys/sys/pcpu.h	(working copy)
@@ -56,12 +56,14 @@
  extern uintptr_t *__start_set_pcpu;
  extern uintptr_t *__stop_set_pcpu;

+__asm__(
  #if defined(__arm__)
-__asm__(".section set_pcpu, \"aw\", %progbits");
+	".section set_pcpu, \"aw\", %progbits\n"
  #else
-__asm__(".section set_pcpu, \"aw\", @progbits");
+	".section set_pcpu, \"aw\", @progbits\n"
  #endif
-__asm__(".previous");
+	"\t.p2align " __XSTRING(CACHE_LINE_SHIFT) "\n"
+	"\t.previous");

  /*
   * Array of dynamic pcpu base offsets.  Indexed by id.
Index: sys/net/vnet.h
===================================================================
--- sys/net/vnet.h	(revision 196086)
+++ sys/net/vnet.h	(working copy)
@@ -185,12 +185,14 @@
   * Virtual network stack memory allocator, which allows global variables to
   * be automatically instantiated for each network stack instance.
   */
+__asm__(
  #if defined(__arm__)
-__asm__(".section " VNET_SETNAME ", \"aw\", %progbits");
+	".section " VNET_SETNAME ", \"aw\", %progbits\n"
  #else
-__asm__(".section " VNET_SETNAME ", \"aw\", @progbits");
+	".section " VNET_SETNAME ", \"aw\", @progbits\n"
  #endif
-__asm__(".previous");
+	"\t.p2align " __XSTRING(CACHE_LINE_SHIFT) "\n"
+	"\t.previous");

  #define	VNET_NAME(n)		vnet_entry_##n
  #define	VNET

------------------------------------------------------------------------

-- 
Bjoern A. Zeeb                      The greatest risk is not taking one.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20090810133111.C93661>