Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 23 Jul 2012 12:24:34 -0700
From:      Adrian Chadd <adrian.chadd@gmail.com>
To:        lev@freebsd.org
Cc:        wireless@freebsd.org
Subject:   Re: ath0: device timeout with 802.11n client
Message-ID:  <CAJ-Vmonq%2Bwef2mVv%2BY6P6TsYpkwKRDcNRh=467xA999pTrFUXw@mail.gmail.com>
In-Reply-To: <1468023399.20120723175744@serebryakov.spb.ru>
References:  <298155894.20120709002844@serebryakov.spb.ru> <CAJ-VmomWfbTZG1z_iEDEXWmFTWSSfDpPJZ0iFFHOio8eGwuUhg@mail.gmail.com> <1491344515.20120709013411@serebryakov.spb.ru> <CAJ-VmokFcHSx9bQtRMFFx5EGXE=P2Zj-omZ266nNfTVd-DELWw@mail.gmail.com> <1252921508.20120709132351@serebryakov.spb.ru> <CAJ-VmonDzZEBubrbTxMbj32X024orbgbRHFz7FpchrHuJigbkg@mail.gmail.com> <802688919.20120709140808@serebryakov.spb.ru> <CAJ-Vmom-s=4sVZ-Y1VJLXMouz3n48PLSdqHJ8ghH6iXyjFUaTg@mail.gmail.com> <1966637204.20120713194935@serebryakov.spb.ru> <CAJ-VmomBznMGZWyfaqibT9HXPGr1ORNaSG838aLPDeY%2B0smoew@mail.gmail.com> <5862675.20120713203432@serebryakov.spb.ru> <CAJ-Vmok%2BtQULYHy3=pqgHTfK-oEfBNV7OMugzOuAosC_Msha9g@mail.gmail.com> <CAJ-Vmon2-1d7dg%2BeDDz4Dpnq_FdW_U24=xdW5yOKKg=NeUsveQ@mail.gmail.com> <7750188.20120716011053@serebryakov.spb.ru> <CAJ-Vmom70sNoRKdVPSH3Aqxi6r=z7BDK6x_ydTEtygunYrLrVA@mail.gmail.com> <CAJ-Vmo=DOPr9PKOfcF4%2BfHXYzSbDEmrTt4iK9NP4UL=akycW5g@mail.gmail.com> <1214412854.20120722020752@serebryakov.spb.ru> <CAJ-Vmo=SuA0pqJdSAS6eKKB1HiXcUi1mxmo7BY2XyK-PZZWccQ@mail.gmail.com> <1468023399.20120723175744@serebryakov.spb.ru>

next in thread | previous in thread | raw e-mail | index | archive | help
.. and is that "bar_unsuspend" the first line to appear in your
debugging output?

Time to braindump what's going on in this aggregation/BAR path:

* the TID pause/unpause only occurs for a few things - negotiating
aggregation, tearing down aggregation, sending BAR;
* and it's refcounted, because it's entirely possible multiple things
will occur (eg sending BAR overlapping with tearing down aggregation);
* So I need to try and establish whether this is something specific to
say, just BAR handling, or whether it's overlapping with a call to
ath_addba_stop() which will call ath_tx_tid_cleanup() to "clean" up
the aggregation frames in preparation for dropping back to
non-aggregation.

Now - ath_addba_stop() is called when:

* the station re-associates for whatever reason, as the state needs to
be torn down and re-established;
* the station times out and the state is being freed.

So maybe some overlapping traffic and BAR TX is occuring, whilst the
station re-associates.

Finally, the way the cleanup path works is a bit hokey:

* the caller pauses the TID;
* it then calls ath_tx_tid_cleanup();
* ath_tx_tid_cleanup() then walks the list of frames that it thinks
are in the hardware queue (don't ask why it does this - I stole it
from Linux/Atheros code, before I just added a per-TID hardware queue
counter) and it marks the TID as "cleanup in progress".
* The TID is then only unpaused once all the hardware-queued frames
have completed (ie, when incomp is decremented and equal to 0.)
* .. that occurs in either ath_tx_comp_cleanup_aggr() and
ath_tx_comp_cleanup_unaggr().

Now, incomp is 0, so it's unlikely due to something being stuck in the
TX queue or the math calculating the queue depth being wrong.

The interesting warning there is the debugging from
ath_tx_tid_bar_unsuspend() - that shouldn't be called with both of
those flags equal to 0. There's only a few places it's called from.
There aren't many places where it could even _be_ set to 0. However,
there's no locking in the net80211 stack for handling BAR, so I have
to compensate by doing locking and workarounds in the TX / BAR path in
the driver. So it's entirely possible I've missed something there.

I think I recall one instance where aggregation was being attempted
concurrently, multiple times, but only calling the completion routine
once. Eek.

Anyway, I hope that braindump helps you or someone else in potentially
going for a dive through the code.

Oh, one last thing - I actually have a local kernel modification to
device_printf() in sys/kern/subr_bus.c (I think, just grep for it)
which prints out the current thread ID before any device output. It
makes concurrent/overlapping/racy behaviour _really_ obvious, as they
come from separate thread IDs. You may wish to hack that in to your
kernel and see if any of the debug statements are coming from
different threads.

Oh, and another last thing - you could also turn on association
debugging (wlandebug -i wlanX +assoc) to capture whether the station
is reassociating constantly. That'll help me out too.

Thanks!



Adrian

On 23 July 2012 06:57, Lev Serebryakov <lev@freebsd.org> wrote:
> Hello, Adrian.
> You wrote 22 =D0=B8=D1=8E=D0=BB=D1=8F 2012 =D0=B3., 2:10:42:
>
>>> AC> Hm. With debuggimg it works, but turning it off causes timeouts?
>>>   I'll turn it off tomorrow and will see
> AC> Thanks. If that's the case, it's quite likely a timing / race
> AC> condition bug and even your debugging logs as they are will be very
> AC> helpful.
>  Again (without debug):
>
> ath0: ath_tx_tid_bar_unsuspend: bar_tx=3D0, bar_wait=3D0: ?
> ath0: device timeout
> ath0: device timeout
> ath0: device timeout
> ath0: device timeout
> ath0: device timeout
> ath0: device timeout
> ath0: device timeout
> ath0: device timeout
> ath0: device timeout
> ath0: ath_tx_tid_drain: node 0xc404b000: bf=3D0xc34fda60: addbaw=3D0, dob=
aw=3D0, seqno=3D3113, retry=3D0
> ath0: ath_tx_tid_drain: node 0xc404b000: bf=3D0xc34fda60: tid txq_depth=
=3D503 hwq_depth=3D0, bar_wait=3D0
> ath0: ath_tx_tid_drain: node 0xc404b000: tid 0: txq_depth=3D1, txq_aggr_d=
epth=3D0, sched=3D0, paused=3D-1, hwq_depth=3D0, incomp=3D0, baw_head=3D111=
, baw_tail=3D111 txa_start=3D3113, ni_txseqs=3D5107
> FRDS 00:0c:42:64:69:92->70:d4:f2:ad:6a:26(00:0c:42:64:69:92) data QoS [TI=
D 0] WEP [IV 29 0c 00 00 00 00 KID 0] 0M
>  8862 0000 70d4 f2ad 6a26 000c 4264 6992 000c 4264 6992 90c2 0000 0101 29=
0c 0020 0000 0000 aaaa 0300 0000 0800
> ath0: stuck beacon; resetting (bmiss count 4)
> ath0: stuck beacon; resetting (bmiss count 4)
> ath0: stuck beacon; resetting (bmiss count 4)
> ath0: stuck beacon; resetting (bmiss count 4)
> ath0: stuck beacon; resetting (bmiss count 4)
>
>
> --
> // Black Lion AKA Lev Serebryakov <lev@FreeBSD.org>
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-Vmonq%2Bwef2mVv%2BY6P6TsYpkwKRDcNRh=467xA999pTrFUXw>