Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 7 Sep 2018 14:48:41 +0200
From:      Emmanuel Vadot <manu@bidouilliste.com>
To:        John-Mark Gurney <jmg@funkthat.com>
Cc:        arm@FreeBSD.org
Subject:   Re: Allwinner awg TX hanging issue
Message-ID:  <20180907144841.5f97ef0aed8f84e3cdba7932@bidouilliste.com>
In-Reply-To: <20180906163547.GC75530@funkthat.com>
References:  <20180906163547.GC75530@funkthat.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 6 Sep 2018 09:35:47 -0700
John-Mark Gurney <jmg@funkthat.com> wrote:

> Since I upgraded to a recent -current to fix the timer issue on my
> A64-LTS board, I've been having an issue where the ethernet interface
> will freeze.  This is with:
> FreeBSD gate2.funkthat.com 12.0-ALPHA4 FreeBSD 12.0-ALPHA4 #4 r338426M: Wed Sep  5 09:55:12 PDT 2018     root@gate2.funkthat.com:/usr/src/sys/arm64/compile/GENERIC  arm64
> 
> The modified code is simply to add some dtrace probe points to debug
> this issue.  I also dropped the check for _OACTIVE from _start_locked.
> 
> It prints flag at the begining of _start_locked and if _OACTIVE gets
> set and at the end of _txeof if progress was made.  It also prints the
> progress at the end of txeof if any...  It prints the val of _intr..
> 
> I noticed that when it was hung, the OACTIVE flag was set, but this
> just means that we ran out of transmit descriptors, and was a symptom
> of the problem.
> 
> I don't have a good test to trigger this problem.  This happens somewhat
> regularly, every 4-12 hours on my router, but my test board, which is
> lightly loaded and does not run pf doesn't have this issue.
> 
> With the added dtrace probe points, I finally hit this:
>   3  10115                        none:intr intr 40000024
>   3  10115                        none:intr intr 40000100
>   3  10115                        none:intr intr 40000100
>   3  10115                        none:intr intr 40000100
>   3  10115                        none:intr intr 40000100
>   3  10114                       none:flags flag 40
>   3  10115                        none:intr intr 40000024
>   3  10115                        none:intr intr 40000100
>   3  10114                       none:flags flag 40
>   3  10115                        none:intr intr 40000024
>   3  10115                        none:intr intr 40000100
>   3  10114                       none:flags flag 40
>   3  10115                        none:intr intr 40000024
>   3  10115                        none:intr intr 40000100
>   3  10114                       none:flags flag 40
>   3  10115                        none:intr intr 40000100
>   3  10115                        none:intr intr 40000024
>   3  10115                        none:intr intr 4000010a
>   3  10114                       none:flags flag 40
>   3  10115                        none:intr intr 40000100
>   3  10114                       none:flags flag 40
>   3  10115                        none:intr intr 40000100
>   3  10114                       none:flags flag 40
> [...]
>   3  10115                        none:intr intr 40000100
>   3  10114                       none:flags flag 40
>   3  10114                       none:flags flag 440
>   3  10115                        none:intr intr 40000100
>   3  10115                        none:intr intr 40000100
>   3  10115                        none:intr intr 40000100
> 
> The intr 24 line is a normal interrupt, and will run txeof to free up
> descriptors.  The intr 100 line is saying the RGMII link status
> changed, we don't enable it, so I'm not sure why we are getting these
> interrupts (it seems like the enable bit is ignored).

 It looks like it's RX_INT, not RGMII_LINK_STATUS

>  These are
> normal, and see these lines for a long while.  The flag 440 line is
> when we set OACTIVE, and then we see no more flag 40 lines, which
> means that _start_locked doesn't get called and that _txeof doesn't
> make forward progress.
> 
> The problem point is the intr 10a line.  Once we hit that line, we
> never get another intr 24 line.  The a is the important part of the
> inter status, as it is:
> 
> 0x8
> TX_TIMEOUT_INT
> When this bit is asserted, the transmitter had been excessively active.

 Linux seems to stop the queue, dma, free the skbuf and restart the dma
when this happens. We might need to do the same.

> and:
> 
> 0x2
> TX_DMA_STOPPED_INT
> When this bit is asserted, the TX DMA FSM is stopped.
> 
> We do not have code in the awg driver to recover from this problem.

 Linux only increment a stat counter when this irq is fired.

> Does anyone have any ideas?
> 
> Thanks.
> 
> -- 
>   John-Mark Gurney				Voice: +1 415 225 5579
> 
>      "All that I will do, has been done, All that I have, has not."
> _______________________________________________
> freebsd-arm@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-arm
> To unsubscribe, send any mail to "freebsd-arm-unsubscribe@freebsd.org"


-- 
Emmanuel Vadot <manu@bidouilliste.com> <manu@freebsd.org>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180907144841.5f97ef0aed8f84e3cdba7932>