Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 11 May 2015 05:37:07 -0300
From:      Christopher Forgeron <csforgeron@gmail.com>
To:        Mark Schouten <mark@tuxis.nl>
Cc:        FreeBSD Net <freebsd-net@freebsd.org>
Subject:   Re: [Bug 199174] em tx and rx hang
Message-ID:  <CAB2_NwBKZhwWe-cfb0GRiEX1iUbzte9=jhYpvV-ALwURJHcJUg@mail.gmail.com>
In-Reply-To: <1107864458-32391@kerio.tuxis.nl>
References:  <bug-199174-2472-LonL56obUY@https.bugs.freebsd.org/bugzilla/> <1107864458-32391@kerio.tuxis.nl>

next in thread | previous in thread | raw e-mail | index | archive | help
I'd go a step further and say it's the _exact_ same problem.

If you're using anything other than 4k clusters on a heavily loaded system,
you'll probably have issues.

It's not just the MTU - Case in point - I set my MTU to 4000, but since my
iSCSI block size is 8k, I noticed that I still had plenty of 9k Jumbo
Clusters in use. I still crash within 1/2 - 1 and 1/2 days of uptime.
Ususally 'ix0 is flapping' or perhaps a kernel panic, or just dead ix's
that won't transmit.

I patched my ixgbe.c to only use 4k clusters, and now I can use a MTU of
9000 again without issue.

I want to take the time to dig up more of my info on this to present to the
list, but I've lost a lot of time to tracking this down.. still cleaning up
as we speak.

The worst about the Jumbo Clusters bug is that it's very specific to a
particular load - My systems were fine, until I took on a new Exchange 2013
load that started popping all the FreeBSD SAN's - And these were
load-tested production machines that had been in service for months without
issues.

In one of these threads, Garrett Wollman points out his ideas for a fix - I
second the idea of a large ring buffer being created at boot for the
network cards to use - and like him, I regretfully have no time to spare to
help.. well, perhaps I can get some time for this.. but I can only help,
not lead.

Here's one of the last machines popping on me tonight before I could get to
it with a patched kernel.  This is a unusual error, usually the 'ix0
flapping' is the most common.

May 11 04:04:06 aa_fast_b kernel: panic: solaris assert: 0 ==
dmu_buf_hold_array(os, object, off

set, size, FALSE, FTAG, &numbufs, &dbp), file:
/usr/src/sys/modules/zfs/../../cddl/contrib/opens

olaris/uts/common/fs/zfs/dmu.c, line: 830

May 11 04:04:06 aa_fast_b kernel: cpuid = 1

May 11 04:04:06 aa_fast_b kernel: KDB: stack backtrace:

May 11 04:04:06 aa_fast_b kernel: #0 0xffffffff80962fd0 at
kdb_backtrace+0x60

May 11 04:04:06 aa_fast_b kernel: #1 0xffffffff809280f5 at panic+0x155

May 11 04:04:06 aa_fast_b kernel: #2 0xffffffff81bbe1fd at assfail+0x1d

May 11 04:04:06 aa_fast_b kernel: #3 0xffffffff81983388 at dmu_write+0x98

May 11 04:04:06 aa_fast_b kernel: #4 0xffffffff819c8ec5 at
space_map_write+0x3c5

May 11 04:04:06 aa_fast_b kernel: #5 0xffffffff819afb30 at
metaslab_sync+0x4e0

May 11 04:04:06 aa_fast_b kernel: #6 0xffffffff819cf69b at vdev_sync+0xcb

May 11 04:04:06 aa_fast_b kernel: #7 0xffffffff819c0fdb at spa_sync+0x5db

May 11 04:04:06 aa_fast_b kernel: #8 0xffffffff819ca3f6 at
txg_sync_thread+0x3a6

May 11 04:04:06 aa_fast_b kernel: #9 0xffffffff808f8b3a at fork_exit+0x9a

May 11 04:04:06 aa_fast_b kernel: #10 0xffffffff80d0ac8e at
fork_trampoline+0xe

May 11 04:04:06 aa_fast_b kernel: Uptime: 1d12h7m45s

May 11 04:04:06 aa_fast_b kernel: (da1:iscsi7:0:0:0): Synchronize cache
failed

May 11 04:04:06 aa_fast_b kernel: (da3:iscsi5:0:0:0): Synchronize cache
failed

May 11 04:04:06 aa_fast_b kernel: (da4:iscsi11:0:0:0): Synchronize cache
failed

May 11 04:04:06 aa_fast_b kernel: (da7:iscsi4:0:0:0): Synchronize cache
failed

May 11 04:04:06 aa_fast_b kernel: (da8:iscsi6:0:0:0): Synchronize cache
failed

May 11 04:04:06 aa_fast_b kernel: (da9:iscsi10:0:0:0): Synchronize cache
failed

May 11 04:04:06 aa_fast_b kernel: (da10:iscsi1:0:0:0): Synchronize cache
failed

It's lots of fun.. it really is.  I'm glad I have a lot of redundancy and
backups.

On Mon, May 11, 2015 at 5:13 AM, Mark Schouten <mark@tuxis.nl> wrote:

> Please note that these issues look very much like the issues I had, before
> I switched from an MTU of 9000 to 1500 ...
>
>
> Met vriendelijke groeten,
>
> --
> Kerio Operator in de Cloud? https://www.kerioindecloud.nl/
> Mark Schouten  | Tuxis Internet Engineering
> KvK: 61527076 | http://www.tuxis.nl/
> T: 0318 200208 | info@tuxis.nl
>
>
>
>  Van:    <bugzilla-noreply@freebsd.org>
>  Aan:    <freebsd-net@FreeBSD.org>
>  Verzonden:   8-5-2015 19:42
>  Onderwerp:   [Bug 199174] em tx and rx hang
>
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199174
>
> --- Comment #15 from Sean Bruno <sbruno@FreeBSD.org> ---
> (In reply to david.keller from comment #14)
> Nothing fancy here.
>
> Server runs "iperf -p 8000 -s"  (8core amd box)
> Client under test runs this forever:
>
> #!/bin/sh
>
> FILE=test.out
>
> if [ -f ${FILE} ]; then
>     rm $FILE;
> fi
>
> while [ 1 ]; do
>     date;
>     iperf -p 8000 -c 192.168.100.1 -t 600 -P ${1} >> $FILE;
> done
>
> --
> You are receiving this mail because:
> You are the assignee for the bug.
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"
>
>
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAB2_NwBKZhwWe-cfb0GRiEX1iUbzte9=jhYpvV-ALwURJHcJUg>