Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 31 Aug 2017 18:04:27 +0200
From:      Ben RUBSON <ben.rubson@gmail.com>
To:        Julien Charbon <jch@freebsd.org>
Cc:        Hans Petter Selasky <hps@selasky.org>, FreeBSD Net <freebsd-net@freebsd.org>, hiren <hiren@strugglingcoder.info>, Slawa Olhovchenkov <slw@zxy.spb.ru>, FreeBSD Stable <freebsd-stable@freebsd.org>
Subject:   Re: mlx4en, timer irq @100%... (11.0 stuck on high network load ???)
Message-ID:  <82EFBD5E-8FC2-4156-A030-AF70D97A37BA@gmail.com>
In-Reply-To: <82e661b4-1bac-ff5b-f776-8dba44cac15e@freebsd.org>
References:  <BF3A3E47-A726-49FB-B83F-666EFC5C0FF1@gmail.com> <7f14c95d-1ef8-bf82-c469-e6566c3aba66@selasky.org> <76A5EE7E-1D2E-46B4-86F1-F219C3DCE6EA@gmail.com> <e6f9df1c-8b55-8a3b-9f44-e67c26561543@selasky.org> <4C91C6E5-0725-42E7-9813-1F3ACF3DDD6E@gmail.com> <5840c25e-7472-3276-6df9-1ed4183078ad@selasky.org> <2ADA8C57-2C2D-4F97-9F0B-82D53EDDC649@gmail.com> <061cdf72-6285-8239-5380-58d9d19a1ef7@selasky.org> <92BEE83D-498F-47D5-A53C-39DCDC00A0FD@gmail.com> <5d8960d8-e1ff-8719-320f-d3ae84054714@selasky.org> <6B4A35F7-5694-4945-9575-19ADB678F9FA@gmail.com> <297a784a-3d80-b1a6-652e-a78621fe5a8b@selasky.org> <3ECCFBF1-18D9-4E33-8F39-0C366C3BB8B4@gmail.com> <c05c2b1c-b5a8-c39c-6dff-e6cc0d8642bf@selasky.org> <0a5787c5-8a53-ab09-971a-dc1cd5f3aca0@freebsd.org> <E4124973-5F01-4EAF-AAF3-F32F419678A4@gmail.com> <645f2ee3-3eaa-660e-2a64-37d53e88322f@freebsd.org> <13DE4E6D-CE83-4B5D-BF88-0EFE65111311@gmail.com> <7B084207-062A-4529-B0DC-5BFEB6517780@gmail.com> <82e661b4-1bac-ff5b-f776-8dba44cac15e@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
> On 28 Aug 2017, at 11:27, Julien Charbon <jch@freebsd.org> wrote:
>=20
> On 8/28/17 10:25 AM, Ben RUBSON wrote:
>>> On 16 Aug 2017, at 11:02, Ben RUBSON <ben.rubson@gmail.com> wrote:
>>>=20
>>>> On 15 Aug 2017, at 23:33, Julien Charbon <jch@freebsd.org> wrote:
>>>>=20
>>>> On 8/11/17 11:32 AM, Ben RUBSON wrote:
>>>>>> On 08 Aug 2017, at 13:33, Julien Charbon <jch@freebsd.org> wrote:
>>>>>>=20
>>>>>> On 8/8/17 10:31 AM, Hans Petter Selasky wrote:
>>>>>>>=20
>>>>>>> Suggested fix attached.
>>>>>>=20
>>>>>> I agree we your conclusion.  Just for the record, more precisely =
this
>>>>>> regression seems to have been introduced with:
>>>>>> (...)
>>>>>> Thus good catch, and your patch looks good.  I am going to just =
verify
>>>>>> the other in_pcbrele_wlocked() calls in TCP stack.
>>>>>=20
>>>>> Julien, do you plan to make this fix reach 11.0-p12 ?
>>>>=20
>>>> I am checking if your issue is another flavor of the issue fixed =
by:
>>>>=20
>>>> https://svnweb.freebsd.org/base?view=3Drevision&revision=3D307551
>>>> https://reviews.freebsd.org/D8211
>>>>=20
>>>> This fix in not in 11.0 but in 11.1.  Currently I did not found how =
an
>>>> inp in INP_TIMEWAIT state can have been INP_FREED without having =
its tw
>>>> set to NULL already except the issue fixed by r307551.
>>>>=20
>>>> Thus could you try to apply this patch:
>>>>=20
>>>> =
https://github.com/freebsd/freebsd/commit/acb5bfda99b753d9ead3529d04f20087=
c5f7d0a0.patch
>>>>=20
>>>> and see if you can still reproduce this issue?
>>>=20
>>> Thank you for your answer Julien.
>>> Unfortunately, I'm not sure at all how to reproduce the issue.
>>> I have other servers which are 100% identical to this one, same =
workload,
>>> same some-months uptime, but they did not trigger the bug yet.
>>>=20
>>> If other network stack experts (I'm not) agree with your analysis,
>>> we could then certainly go further with D8211 / r307551.
>>>=20
>>> One thing that perhaps might help :
>>> # netstat -an | grep TIME_WAIT$ | wc -l
>>> 468
>>>=20
>>> Note that due to this running bug, sendmail has lots of difficulties =
to send outgoing mails.
>>> As soon as I run the above netstat command, I receive a lot of =
stacked mails (more than 20 this time).
>>> As if netstat was able to somehow help...
>>>=20
>>> Number of TIME_WAIT connections however does not decrease, but =
increases.
>>>=20
>>>> And in the spirit of r307551 fix and based on Hans patch I will =
also
>>>> propose to add a kernel log describing the issue instead of =
starting an
>>>> infinite loop when INVARIANT is not set.
>>>=20
>>> Which should then never be triggered :)
>>> Good idea I think !
>>=20
>> What about :
>> D8211/r307551
>> + Hans' patch
>> + Julien's idea of a kernel log (sort of "We should not be here but =
we are")
>=20
> I did this change and I am testing it

Good news !

> on your side did you try this patch applied on 11.0?
>=20
> =
https://github.com/freebsd/freebsd/commit/acb5bfda99b753d9ead3529d04f20087=
c5f7d0a0.patch

Yes, patch applied and running correctly,
however hard to say whether or not it solves this issue,
as there is no easy way to reproduce it.

>> And backporting all this to 11.0 (and so to 11.1 too) ?
>>=20
>> As this bug can impact every FreeBSD machine / server,
>> leading to an unavailable / unreachable system (this is how mine =
ended),
>> sounds like it could inevitably be a good thing, for production =
stability purpose.
>=20
> The main fix for your issue is (I believe):
>=20
> Fix a double-free when an inp transitions to INP_TIMEWAIT state
> after having been dropped.
> https://svnweb.freebsd.org/base?view=3Drevision&revision=3D307551
>=20
> This fix has been MFC-ed on both stable/11, stable/10 and is already
> included in 11.1 and will be in 10.4.  To push in 11.0 release =
directly,
> I guess you have to promote this change to an Errata (never did that
> myself):
>=20
> https://www.freebsd.org/security/notices.html
> https://www.freebsd.org/security/security.html#reporting

Mail sent to FreeBSD Security Team !

Many thanks, let's stay tuned !

Ben




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?82EFBD5E-8FC2-4156-A030-AF70D97A37BA>