From owner-freebsd-wireless@FreeBSD.ORG Wed Mar 26 14:48:43 2014 Return-Path: Delivered-To: freebsd-wireless@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 8C097A05 for ; Wed, 26 Mar 2014 14:48:43 +0000 (UTC) Received: from mail-qc0-x22d.google.com (mail-qc0-x22d.google.com [IPv6:2607:f8b0:400d:c01::22d]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 4F6FE8E9 for ; Wed, 26 Mar 2014 14:48:43 +0000 (UTC) Received: by mail-qc0-f173.google.com with SMTP id r5so2616539qcx.4 for ; Wed, 26 Mar 2014 07:48:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:date:message-id:subject:from:to:content-type; bh=zc3qj9KUTD5t4k7CXlMpc6HAzV62Nxo9cDUrVrR9n7A=; b=Md5ACj3ezySUMBIhtHOAm9nwnlcE2N98zytxiGbCs+MBBjNaqP/NX3odW9z9kHdmRy IHouzjOxmwHXEj/9YbXl8qBpQxEt+gazzTZer881j/zCIoQHYeLoXQ5LPlXH0j02v8Ly ETq+i8qLwM99LpEvJQgWf0I8dogeYMBzsPNuKicTDyPUgyOFtu2JNmfajGD9DxzeZXY3 K9EhIa+4XJm1IJUjWxNvJ29S4ewiVNh1QmEnNZJMommWLR2xIxrjgl1is9jzVgtH25kG kyiyAZ5Hlok+Dg1s7gadRC6qB3m8qhMf8oJS5YOv4KqAQkFbi3w14Uun333m1OalA0Je xwZg== MIME-Version: 1.0 X-Received: by 10.224.151.82 with SMTP id b18mr2737567qaw.98.1395845322533; Wed, 26 Mar 2014 07:48:42 -0700 (PDT) Sender: adrian.chadd@gmail.com Received: by 10.224.125.1 with HTTP; Wed, 26 Mar 2014 07:48:42 -0700 (PDT) Date: Wed, 26 Mar 2014 07:48:42 -0700 X-Google-Sender-Auth: ByBBHnYfovy4mc_SLLyDR0w2bEM Message-ID: Subject: [ath] debugging (another) 11n TX hang.. From: Adrian Chadd To: "freebsd-wireless@freebsd.org" Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: freebsd-wireless@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: "Discussions of 802.11 stack, tools device driver development." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Mar 2014 14:48:43 -0000 Hi, This is mostly a "I should really document this for others to follow" post. Hiren and I are seeing situations where ath(4) in hostap mode will hang. It'll hang, but only to one station (ie, the NIC and driver isn't TX hung; it's something busted in the software queue handling.) Rui changed the debugging so you don't get packet printing unless the netif "debug" flag is enabled and silenced the flood of packet printf's that go on when a client disassociates with frames in the software queue. I erm, put that debugging back, under ATH_DEBUG_RESET (sysctl dev.ath.X.debug=0x20). So, with that in mind, I turned it on, started some traffic, and waited for it to hang. What I saw was. Annoying. The log is on another laptop; I'll post it here. But the summarised output is: ath1: ath_tx_tid_drain_print: norm: : ... seqno=3764, retry=0 ath1: ath_tx_tid_drain_print: norm: : txq[1] axq_depth=0 axq_aggr_depth=0 ath1: ath_tx_tid_drain_print: norm: : tid txq_depth=129 hwq_depth=0 bar_wait=0 isfiltered=0 ath1: ath_tx_tid_drain_print: norm: : tid 0: sched=1, paused=0, incomp=0, baw_head=125, baw_tail=61, txa_start=3700, ni_txseqs=3893 There's 128 more of them, for each of the packets in the software queue. Now, the basics: * seqno = assigned sequence number for this frame * txq[1] = this TID maps to the hardware txq[1] * the axq_depth / axq_aggr_depth shows that no hardware frames are queued to the device. One thing that used to happen is that a frame would get "stuck" in the MAC (mostly because of long standing DMA setup bugs) and if axq_depth or axq_aggr_depth > 2, the software queue code won't queue anything further. * the tid txq_depth=129, so there's 129 frames in the queue * hwq_depth=0, there's no frames from this TID queued to the hardware * bar_wait=0, so the TID isn't waiting for a BAR frame to complete before the queue is unpaused * isfiltered=0, so there's no filtered traffic waiting to complete before we can continue (eg power save handling) Now the fun bits: * sched=1 - the software scheduler is actually trying to send frames * paused=0, incomp=0 - the TID isn't paused (eg wiating for BAR to complete, ADDBA negotiation, etc) * incomp=0 - ie, there's no frames that need to complete before the queue is unpaused (see above) So, ok. it all looks fine. Next, I waited until the thing again hung, and this time I logged in via a different laptop and set up software TX debugging: # sysctl dev.ath.1.debug=0x1fc000020 And this is what I got: ath1: ath_txq_sched: tid=1, paused=0 ath1: ath_tx_tid_hw_queue_aggr: tid=1 ath1: ath_tx_rate_fill_rcflags: i=0, rate=0x8b, flags=0x19, max4ms=25708 ath1: ath_tx_rate_fill_rcflags: i=1, rate=0x8a, flags=0x19, max4ms=19280 ath1: ath_tx_rate_fill_rcflags: i=2, rate=0x89, flags=0x19, max4ms=12852 ath1: ath_get_aggr_limit: max frame len= 12852 ath1: ath_tx_tid_hw_queue_aggr: ath_tx_form_aggr() status=1 .. so, the software scheduler was running, it tried queuing something out, and ath_tx_form_aggr() returned 1 which is ATH_AGGR_BAW_CLOSED. _that_ happens in this condition: /* * If the packet has a sequence number, do not * step outside of the block-ack window. */ if (! BAW_WITHIN(tap->txa_start, tap->txa_wnd, SEQNO(bf->bf_state.bfs_seqno))) { status = ATH_AGGR_BAW_CLOSED; break; } .. now, let's look at what's going on. the first frame seqno here is (say) 3764. The window is 64 frames. txa_start is 3700. So, the first frame is definitely outside the sequence window and ath_tx_form_aggr() is quite rightly saying "nope, we can't schedule this yet." Now, I have no idea right now why this is happening. It may be some off-by-one in some comparison somewhere. The check here isn't an off-by-one as the seqno in question is plainly out of sequence. It _could_ be that there was a hole in the sequence numbers being assigned; that would certainly mess things up. I'll go and review the various places where sequence numbers are assigned and whether all the locking is correct. Unfortunately (!) I made a point of using the net80211 per-TID sequence number field which in the case of net80211 didn't use to be locked at all. I may have to just ditch that and have the driver set its own sequence numbers if aggregation is used. (Maybe for debugging I can track it privately too in the ath driver and complain if they drift out of sync.) I'll file a PR with this so it's being tracked. Anyway - this wifi stuff is mystical but there's plenty of logging and debugging going on right now. What I do note however is that I really wish the MAC addresses of the nodes were logged, as it's a pain to figure out which logging entries match up with which node. As a side note - if anyone would like a smallish project - it'd be nice to introduce a DPRINTF() for the ath driver code that took a node parameter too - so I can do per-node debugging as well as global driver debugging. That way I can do things like turn on debugging only for a given node and still use the VAP for other devices. -a