Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 10 May 2018 19:54:36 +0200
From:      Harry Schmalzbauer <freebsd@omnilan.de>
To:        Stephen Hurd <shurd@llnw.com>
Cc:        "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>, Stephen Hurd <shurd@freebsd.org>, Kevin Bowling <kevin.bowling@kev009.com>
Subject:   Re: iflib-if_em tests with HEAD and lagg panic [Was: Re: svn commit: r333338 - in stable/11/sys: dev/bnxt kern net sys]
Message-ID:  <5AF4875C.5000201@omnilan.de>
In-Reply-To: <CAGK_Ob1XR_D_B=vXeHtMQwHA2yXhhWPfMtpwHKwDbGfoWgaOVw@mail.gmail.com>
References:  <201805072142.w47LgN1R041002@repo.freebsd.org> <5AF16B8B.7030703@omnilan.de> <CAK7dMtBkCvLgPVnsf%2BECcrdbKNvOShONeZ=vqvg3dJ5ZeuoP5w@mail.gmail.com> <5AF17134.7020602@omnilan.de> <CAK7dMtB3V1F=2AxtsbUznn5DO81G3Zkh9UYiN3eWkyOfV_CYmg@mail.gmail.com> <5AF1CF0F.4040909@omnilan.de> <65972f0d-2873-42ea-464c-a3db543abafb@freebsd.org> <5AF1E073.5010701@omnilan.de> <CAGK_Ob1XR_D_B=vXeHtMQwHA2yXhhWPfMtpwHKwDbGfoWgaOVw@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Bezüglich Stephen Hurd's Nachricht vom 08.05.2018 20:58 (localtime):
> Can you test the review here: https://reviews.freebsd.org/D15355
> 
> It looks like there are two different locks protecting the same data
> everywhere but in lagg_ioctl().  This is a rough first-pass, and there may
> be some lingering recursion and performance regressions with it.
> 
…
>> Sleeping on "e1000_delay" with the following non-sleepable locks held:
>> exclusive rm if_lagg rmlock (if_lagg rmlock) r = 0 (0xfffff80014228c08)
>> locked @ /usr/src/sys/net/if_lagg.c:1433
>> stack backtrace:
>> #0 0xffffffff80701113 at witness_debugger+0x73
>> #1 0xffffffff807024f1 at witness_warn+0x461
>> #2 0xffffffff806a42cc at _sleep+0x6c
>> #3 0xffffffff806a4b34 at pause_sbt+0x144
>> #4 0xffffffff80440e21 at e1000_write_phy_reg_mdic+0xf1
>> #5 0xffffffff804446bf at e1000_enable_phy_wakeup_reg_access_bm+0x2f
>> #6 0xffffffff80432e0a at e1000_update_mc_addr_list_pch2lan+0x3a
>> #7 0xffffffff8041408f at em_if_multi_set+0x1bf
>> #8 0xffffffff807bc02e at iflib_if_ioctl+0xfe
>> #9 0xffffffff82111a15 at lagg_ioctl+0x115
>> #10 0xffffffff807dd348 at inm_release_task+0x218
>> #11 0xffffffff806dea29 at gtaskqueue_run_locked+0x139
>> #12 0xffffffff806de7a8 at gtaskqueue_thread_loop+0x88
>> #13 0xffffffff80659d84 at fork_exit+0x84
>> #14 0xffffffff809b767e at fork_trampoline+0xe
>> Sleeping thread (tid 100017, pid 0) owns a non-sleepable lock
>> KDB: stack backtrace of thread 100017:
>> sched_switch() at sched_switch+0x945/frame 0xfffffe00750dc5d0
>> mi_switch() at mi_switch+0x18c/frame 0xfffffe00750dc600
>> sleepq_switch() at sleepq_switch+0x10d/frame 0xfffffe00750dc640
>> sleepq_timedwait() at sleepq_timedwait+0x50/frame 0xfffffe00750dc680
>> _sleep() at _sleep+0x307/frame 0xfffffe00750dc730
>> pause_sbt() at pause_sbt+0x144/frame 0xfffffe00750dc780
>> e1000_write_phy_reg_mdic() at e1000_write_phy_reg_mdic+0xf1/frame
>> 0xfffffe00750dc7c0
>> e1000_enable_phy_wakeup_reg_access_bm() at
>> e1000_enable_phy_wakeup_reg_access_bm+0x2f/frame 0xfffffe00750dc7e0
>> e1000_update_mc_addr_list_pch2lan() at
>> e1000_update_mc_addr_list_pch2lan+0x3a/frame 0xfffffe00750dc820
>> em_if_multi_set() at em_if_multi_set+0x1bf/frame 0xfffffe00750dc870
>> iflib_if_ioctl() at iflib_if_ioctl+0xfe/frame 0xfffffe00750dc8e0
>> lagg_ioctl() at lagg_ioctl+0x115/frame 0xfffffe00750dc990
>> inm_release_task() at inm_release_task+0x218/frame 0xfffffe00750dc9f0
>> gtaskqueue_run_locked() at gtaskqueue_run_locked+0x139/frame
>> 0xfffffe00750dca40
>> gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0x88/frame
>> 0xfffffe00750dca70
>> fork_exit() at fork_exit+0x84/frame 0xfffffe00750dcab0
>> fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00750dcab0
>> --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
>> panic: sleeping thread
>> cpuid = 3
>> time = 1525794682
>> KDB: stack backtrace:
>> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
>> 0xfffffe008fe180e0
>> vpanic() at vpanic+0x1a3/frame 0xfffffe008fe18140
>> panic() at panic+0x43/frame 0xfffffe008fe181a0
>> propagate_priority() at propagate_priority+0x335/frame 0xfffffe008fe181e0
>> turnstile_wait() at turnstile_wait+0x38d/frame 0xfffffe008fe18230
>> __mtx_lock_sleep() at __mtx_lock_sleep+0x1e1/frame 0xfffffe008fe182b0
>> __mtx_lock_flags() at __mtx_lock_flags+0xf9/frame 0xfffffe008fe18300
>> _rm_rlock() at _rm_rlock+0x280/frame 0xfffffe008fe18330
>> _rm_rlock_debug() at _rm_rlock_debug+0x14c/frame 0xfffffe008fe18380
>> lagg_transmit() at lagg_transmit+0x38/frame 0xfffffe008fe183f0
>> ether_output_frame() at ether_output_frame+0xaa/frame 0xfffffe008fe18420
>> ether_output() at ether_output+0x68b/frame 0xfffffe008fe184c0
>> arprequest() at arprequest+0x474/frame 0xfffffe008fe185c0
>> arp_ifinit() at arp_ifinit+0x58/frame 0xfffffe008fe18600
>> ether_ioctl() at ether_ioctl+0x1d1/frame 0xfffffe008fe18630
>> lagg_ioctl() at lagg_ioctl+0x602/frame 0xfffffe008fe186e0
>> in_control() at in_control+0x8f5/frame 0xfffffe008fe18780
>> ifioctl() at ifioctl+0x19c6/frame 0xfffffe008fe18850
>> kern_ioctl() at kern_ioctl+0x2b9/frame 0xfffffe008fe188b0
>> sys_ioctl() at sys_ioctl+0x168/frame 0xfffffe008fe18980
>> amd64_syscall() at amd64_syscall+0x2cc/frame 0xfffffe008fe18ab0
>> fast_syscall_common() at fast_syscall_common+0x101/frame
>> 0xfffffe008fe18ab0
>> --- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x8004820ba, rsp =
>> 0x7fffffffe1c8, rbp = 0x7fffffffe210 ---
>> KDB: enter: panic

I can confirm that the D15355 version I tested eleminates that panic.
Also no LOR with em0+em1 as laggports.

>From the kawela report:
> Bezüglich Kevin Bowling's Nachricht vom 08.05.2018 11:52 (localtime):
>> On Tue, May 8, 2018 at 2:43 AM, Harry Schmalzbauer
<freebsd@omnilan.de> wrote:
> …
>>> But if the simple iflib/hw-support test with kawela+hartwell helps I'm
>>> happy to do.
>>
>> At this point it would be helpful, we think e1000 is nearing pretty
>> good shape and I need to become familiar with any outstanding bugs.
>
> Here's the results for kawela (82576) which, to my surprise, still shows
> up as "igb" – I thought it would be "emX".

…
> Running simple NFS4 copies with all offloading bells and whistles
> enabled and MTU 9000 work fine (over IPv6 and LACP) at full line rate.
>
> Only one LACP LOR (no panic as with emo+em1 lagg, where I saw pages full
> of LORs):
> lock order reversal: (sleepable after non-sleepable)
>  1st 0xfffff80002bc9208 if_lagg rmlock (if_lagg rmlock) @
> /usr/src/sys/net/if_lagg.c:1433
>  2nd 0xfffff80002c04550 iflib ctx lock (iflib ctx lock) @
> /usr/src/sys/net/iflib.c:3999
> stack backtrace:
> #0 0xffffffff80701113 at witness_debugger+0x73
> #1 0xffffffff80700f94 at witness_checkorder+0xe34
> #2 0xffffffff806a26a8 at _sx_xlock+0x68
> #3 0xffffffff807bbfbc at iflib_if_ioctl+0x8c
> #4 0xffffffff8079e5f4 at if_addmulti+0x264
> #5 0xffffffff821144a8 at lagg_setmulti+0x108
> #6 0xffffffff82111a28 at lagg_ioctl+0x128
> #7 0xffffffff8079e5f4 at if_addmulti+0x264
> #8 0xffffffff807d8b7e at in_joingroup_locked+0x1ce
> #9 0xffffffff807d8982 at in_joingroup+0x42
> #10 0xffffffff807d47cb at in_control+0x93b
> #11 0xffffffff8079d656 at ifioctl+0x19c6
> #12 0xffffffff807068c9 at kern_ioctl+0x2b9
> #13 0xffffffff80706598 at sys_ioctl+0x168
> #14 0xffffffff809dab2c at amd64_syscall+0x2cc
> #15 0xffffffff809b71ad at fast_syscall_common+0x101

This LOR (igb0+igb1 as laggports) also vanished with the D15355 version
I tested.
Please excuse that I'm not familar with the phabricator and just did
"raw diff download" after briefly flying over the comments.
According to st_mtime this was on May 9th, 08:14:02 UTC (10:14 local
(CEST) time).
No idea what timezone phabricator reports to me, most likely respecting
local time.  Which means latest revision was part of my test – but I'm
not sure...
Shall I redo?

-harry



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5AF4875C.5000201>