FreeBSD Mail Archives

Date:      Fri, 1 Apr 2011 11:55:14 -0700
From:      Jack Vogel <jfvogel@gmail.com>
To:        Arnaud Lacombe <lacombar@gmail.com>
Cc:        freebsd-net@freebsd.org
Subject:   Re: em(4) hang [Was: Re: igb(4) won't start with "igb0: Could not setup receive structures"]
Message-ID:  <AANLkTikUKx4NW=ejJ9N9JuWE9iB96y5p%2BdgUhZ3xZS5s@mail.gmail.com>
In-Reply-To: <AANLkTi=0OkSLnz0cpv02Jrxz_piOhMT40m7xWK0NCiuH@mail.gmail.com>
References:  <AANLkTin64gGxRituE2B%2BsfVpRXt2QetdNLaV7HCf0uNE@mail.gmail.com> <AANLkTi=OjzMrjCPZ2VFDBf6URTaMoAzQqXbxWLv3d9mW@mail.gmail.com> <AANLkTikvbvr%2BY=Fh2fPVieHkTRix%2Bni61jVPct10NKfD@mail.gmail.com> <AANLkTina-MO4GuK66ZJN0hipp%2BVCa-CUxEz79rzRt-cZ@mail.gmail.com> <AANLkTi=OVSOitMvdjHexbv-fu0fA1WWOHo7gm-=MtPRf@mail.gmail.com> <AANLkTikmjmBKf9XUuSrYQz4T7xsR5ynvxHm2cjEDtFE%2B@mail.gmail.com> <AANLkTimut2BMxvhkkyREnK_izXek5tAT5jrw8tW%2BNKVY@mail.gmail.com> <AANLkTin1KKiPKEf_KquG0NrbqExDsGPU_tizam7tYV9Y@mail.gmail.com> <AANLkTi=0OkSLnz0cpv02Jrxz_piOhMT40m7xWK0NCiuH@mail.gmail.com>

Arnaud,

Please try the code change I just checked into HEAD, it should finally
resolve
any hang that is due to mbufs not being refreshed. That's not to say there
may
not be other reasons out there but I'm keeping my fingers crossed that this
is
behind at least some of the hangs.

Jack


On Thu, Mar 31, 2011 at 6:16 PM, Jack Vogel <jfvogel@gmail.com> wrote:

> I know how I'm going to handle this, am formulating code for it, should
> have a
> something that can be tested tomorrow, time to head out for the night..
>
> Essentially, rather than just looking for equality, I will calculate the
> number
> of unrefreshed mbufs given the check/refresh values, and then call refres=
h
> when anything is unrefreshed. This will happen in rxeof, but I will also
> put
> back the rx interrupt trigger into local timer. I'm pretty sure this will
> be
> bullet proof, at least for this kind of hang.
>
> Jack
>
>
> On Thu, Mar 31, 2011 at 5:28 PM, Jack Vogel <jfvogel@gmail.com> wrote:
>
>> You know what Arnaud, I've looked at the numbers again, and I suddenly s=
aw
>> that next_to_check and next_to_refresh are NOT in a good state, exactly
>> the
>> opposite, check is BEHIND refresh, which means the whole ring is empty,
>> the
>> HEAD (next_to_check) is pointing at 929, but next_to_refresh is at 930,
>> RIGHT
>> IN FRONT of it, so the whole ring is depleted!!
>>
>> What this means is that just a test of check =3D=3D refresh is not going=
 to be
>> good
>> enough to protect against all cases,  so let me think about how to handl=
e
>> this...
>>
>> Jack
>>
>>
>>
>> On Thu, Mar 31, 2011 at 4:38 PM, Jack Vogel <jfvogel@gmail.com> wrote:
>>
>>> My validation group has some kind of hang... happens when they use a
>>> certain number
>>> of clients each running a stress test to the SUT, its like this, no rea=
l
>>> handle on what's
>>> wrong, if I knew what was wrong it would be half way or more to fixing =
it
>>> :)
>>>
>>> The evidence shows you have hit the max clusters at one point, but have
>>> freed most
>>> of them back up again, there is no shortage right at this point. Your
>>> previous data
>>> showed a normal idle head/tail relationship....
>>>
>>> Just as a data point, will you please disable msix, recompile and run i=
n
>>> MSI mode,
>>> I just want to see if that makes a difference. Search in the driver for
>>> em_enable_msix
>>> and set it FALSE.
>>>
>>> Jack
>>>
>>>
>>>
>>> On Thu, Mar 31, 2011 at 4:06 PM, Arnaud Lacombe <lacombar@gmail.com>wro=
te:
>>>
>>>> Hi,
>>>>
>>>> On Thu, Mar 31, 2011 at 6:28 PM, Jack Vogel <jfvogel@gmail.com> wrote:
>>>> > OK, but those are not something present in this data, that was what
>>>> I'm
>>>> > asking.
>>>> >
>>>> > So, you have a hang for which we do not have a certain cause.  What
>>>> does
>>>> > netstat -m show?
>>>> >
>>>> # netstat -m
>>>> 3073/74927/78000 mbufs in use (current/cache/total)
>>>> 3070/29698/32768/32768 mbuf clusters in use (current/cache/total/max)
>>>> 0/383 mbuf+clusters out of packet secondary zone in use (current/cache=
)
>>>> 0/12800/12800/12800 4k (page size) jumbo clusters in use
>>>> (current/cache/total/max)
>>>> 0/0/0/6400 9k jumbo clusters in use (current/cache/total/max)
>>>> 0/0/0/3200 16k jumbo clusters in use (current/cache/total/max)
>>>> 6908K/129327K/136236K bytes allocated to network (current/cache/total)
>>>> 0/1080/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
>>>> 0/0/0 requests for jumbo clusters denied (4k/9k/16k)
>>>> 0/7/6656 sfbufs in use (current/peak/max)
>>>> 0 requests for sfbufs denied
>>>> 0 requests for sfbufs delayed
>>>> 0 requests for I/O initiated by sendfile
>>>> 0 calls to protocol drain routines
>>>>
>>>> Note that the mbuf allocation denial did not appended at once. It has
>>>> been progressively increasing by block of ~200 over the 5h of uptime
>>>> of the machine, until the current condition occurred.
>>>>
>>>> I have previously been trying to simulate the depletion and the hang,
>>>> but the driver recovered. I assume the condition is met in
>>>> em_local_timer() to refresh the ring, I'd still need to check that.
>>>>
>>>>  - Arnaud
>>>>
>>>> > Jack
>>>> >
>>>> >
>>>> > On Thu, Mar 31, 2011 at 3:15 PM, Arnaud Lacombe <lacombar@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> Hi,
>>>> >>
>>>> >> On Thu, Mar 31, 2011 at 5:57 PM, Jack Vogel <jfvogel@gmail.com>
>>>> wrote:
>>>> >> > So, what is the evidence that the driver is stuck here?
>>>> >> >
>>>> >> About 800 pps (mostly SYN) present wire but never ever seen on em0,
>>>> >> plus a couple of ARP reply, which still never hit em0, plus the
>>>> >> `missed_packets' count increasing by the same 800 pps in the last
>>>> >> hour. Is that enough ?
>>>> >>
>>>> >>  - Arnaud
>>>> >>
>>>> >> ps: I forgot to add that MAC address on the wire are fine.
>>>> >>
>>>> >> > I see that next_to_check !=3D next_to_refresh, which is why the
>>>> >> > local timer won't schedule anything. OH, and I also realized ther=
e
>>>> >> > is a problem with local_timer anyway, it will run rxeof, but that
>>>> won't
>>>> >> > help
>>>> >> > if you can't enter the loop, so I need to add some code at the to=
p
>>>> to
>>>> >> > call em_refresh_mbufs() when in this state.
>>>> >> >
>>>> >> > On this interrupt cause that you are focused upon, although its
>>>> there in
>>>> >> > the
>>>> >> > design, I had talked with some of our most seasoned developers on
>>>> both
>>>> >> > the Windows and Linux side of the house, and NO one has ever used
>>>> this
>>>> >> > 'feature', because (and I'm quoting here) "there's no good use ca=
se
>>>> for
>>>> >> > it".
>>>> >> > Meaning, there's always some simpler way of handling the issue.
>>>> >> >
>>>> >> > When you use MSIX you can't read causes btw, if you configured it=
,
>>>> it
>>>> >> > would
>>>> >> > mean you'd just get into the regular RX handler, same as always, =
so
>>>> why
>>>> >> > some special bother with this cause?
>>>> >> >
>>>> >> > On non-MSIX hardware there is just no particular reason to worry
>>>> about
>>>> >> > the
>>>> >> > cause either, we can just handle the RX situation in the interrup=
t
>>>> >> > handler.
>>>> >> >
>>>> >> > Jack
>>>> >> >
>>>> >> >
>>>> >> > On Thu, Mar 31, 2011 at 2:09 PM, Arnaud Lacombe <
>>>> lacombar@gmail.com>
>>>> >> > wrote:
>>>> >> >>
>>>> >> >> Hi Jack,
>>>> >> >>
>>>> >> >> On Thu, Mar 31, 2011 at 9:51 AM, Arnaud Lacombe <
>>>> lacombar@gmail.com>
>>>> >> >> wrote:
>>>> >> >> > [...]
>>>> >> >> > I'll remove part of the changes I made to keep only
>>>> >> >> > `rx_forced_refill'
>>>> >> >> > and the associated sysctl, re-run the tests and come back with
>>>> >> >> > correct
>>>> >> >> > value, hopefully in a few hours.
>>>> >> >> >
>>>> >> >> Here it is:
>>>> >> >>
>>>> >> >> # sysctl dev.em.0.%desc
>>>> >> >> dev.em.0.%desc: Intel(R) PRO/1000 Network Connection 7.2.2
>>>> >> >>
>>>> >> >> # sysctl dev.em.0.mac_stats.missed_packets
>>>> >> >> dev.em.0.mac_stats.missed_packets: 917428
>>>> >> >>
>>>> >> >> # sysctl dev.em.0.debug=3D1
>>>> >> >> dev.em.0.debug: I-1nterface is RUNNING and INACTIVE
>>>> >> >> em0: hw tdh =3D 975, hw tdt =3D 975
>>>> >> >> em0: hw rdh =3D 884, hw rdt =3D 885
>>>> >> >> em0: Tx Queue Status =3D 0
>>>> >> >> em0: TX descriptors avail =3D 1024
>>>> >> >> em0: Tx Descriptors avail failure =3D 0
>>>> >> >> em0: RX discarded packets =3D 0
>>>> >> >> em0: RX Next to Check =3D 884
>>>> >> >> em0: RX Next to Refresh =3D 885
>>>> >> >>  -> -1
>>>> >> >>
>>>> >> >> So the taskqueue cannot be scheduled to run and the driver is
>>>> stuck.
>>>> >> >>
>>>> >> >> > On Wed, Mar 30, 2011 at 2:22 PM, Jack Vogel <jfvogel@gmail.com=
>
>>>> >> >> > wrote:
>>>> >> >> >> Read the code in HEAD, em_local_timer() has a test of ALL the
>>>> rx
>>>> >> >> >> queues
>>>> >> >> >> and
>>>> >> >> >> will schedule a task that refreshes mbufs if they are empty.
>>>> This
>>>> >> >> >> has
>>>> >> >> >> exactly the
>>>> >> >> >> same effect as checking for some interrupt cause, a cause tha=
t
>>>> is
>>>> >> >> >> not
>>>> >> >> >> available
>>>> >> >> >> when using MSIX on 82574, but this approach works for
>>>> everything.
>>>> >> >> >>
>>>> >> >> Can you please point me to a reference datasheet (or errata),
>>>> provided
>>>> >> >> by Intel, about the RX Overrun interrupt not being available wit=
h
>>>> >> >> MSI-X on the 82574 ?
>>>> >> >>
>>>> >> >> Currently, I only have access to [0], which precises the
>>>> following:
>>>> >> >>
>>>> >> >> 7.4 Interrupts
>>>> >> >> 7.4.2 MSI-X Mode
>>>> >> >> [...]
>>>> >> >> The following configuration and parameters are involved:
>>>> >> >> =95 The IVAR.INT_Alloc[4:0] entries map two Tx queues, two Rx qu=
eues
>>>> and
>>>> >> >> other
>>>> >> >> events to 5 interrupt vectors
>>>> >> >> =95 The ICR[24:20] bits reflect specific interrupt causes
>>>> >> >> =95 Five MSI-X interrupt vectors are provided (calculated based =
on
>>>> four
>>>> >> >> vectors for
>>>> >> >> queues and one vector for other causes). The requested number of
>>>> >> >> vectors
>>>> >> >> is
>>>> >> >> loaded from the MSI_X_N fields in the EEPROM into the PCIe MSI-X
>>>> >> >> capability
>>>> >> >> structure of the function.
>>>> >> >>
>>>> >> >> 10.2.4.1 Interrupt Cause Read Register - ICR (0x000C0; RC/WC)
>>>> >> >> [...]
>>>> >> >>
>>>> >> >> about bit 24:
>>>> >> >>
>>>> >> >> Other Interrupt. Indicates one of the following interrupts was
>>>> set:
>>>> >> >> =95 Link Status Change.
>>>> >> >> =95 Receiver Overrun.
>>>> >> >> =95 MDIO Access Complete.
>>>> >> >> =95 Small Receive Packet Detected.
>>>> >> >> =95 Receive ACK Frame Detected.
>>>> >> >> =95 Manageability Event Detected.
>>>> >> >>
>>>> >> >> Thanks in advance,
>>>> >> >>  - Arnaud
>>>> >> >>
>>>> >> >> [0]: ftp://download.intel.com/design/network/datashts/82574.pdf
>>>> >> >
>>>> >> >
>>>> >
>>>> >
>>>>
>>>
>>>
>>
>

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?AANLkTikUKx4NW=ejJ9N9JuWE9iB96y5p%2BdgUhZ3xZS5s>

Header And Logo

Peripheral Links

Site Navigation