Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 16 Sep 2017 19:41:18 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Alexander Leidinger <Alexander@leidinger.net>
Cc:        Bruce Evans <brde@optusnet.com.au>, Scott Long <scottl@samsco.org>,  Sean Bruno <sbruno@freebsd.org>, Stephen Hurd <shurd@freebsd.org>,  Cy Schubert <Cy.Schubert@komquats.com>,  Ngie Cooper <yaneurabeya@gmail.com>, src-committers@freebsd.org,  svn-src-all@freebsd.org, svn-src-head@freebsd.org
Subject:   Re: svn commit: r323516 - in head/sys: dev/bnxt dev/e1000 kern net sys
Message-ID:  <20170916192800.E14782@besplex.bde.org>
In-Reply-To: <20170916110159.Horde.lN4uQjj9fb7hJ2309eQrexb@webmail.leidinger.net>
References:  <201709130711.v8D7BlTS003204@slippy.cwsent.com> <48654d1f-4cc7-da05-7a73-ef538b431560@freebsd.org> <1EBD0641-002D-409C-B18E-AAB5FCDECEBA@samsco.org> <20170916124826.P1107@besplex.bde.org> <20170916110159.Horde.lN4uQjj9fb7hJ2309eQrexb@webmail.leidinger.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 16 Sep 2017, Alexander Leidinger wrote:

> Quoting Bruce Evans <brde@optusnet.com.au> (from Sat, 16 Sep 2017 13:46:37 
> +1000 (EST)):
>
>> It gives lesser breakage here:
>> - with an old PCI em, an error that occur every few makeworlds over nfs now
>>   hang the hardware.  It used to be recovered from afger about 10 seconds.
>>   This only happened once.  I then applied my old fix which ignores the
>>   error better so as to recover from it immediately.  This seems to work as
>>   before.
>
> As I also have an em device which switches into non-working state: what's the 
> patch you have for this? I would like to see if your change also helps my 
> device to get back into working shape again.

X Index: em_txrx.c
X ===================================================================
X --- em_txrx.c	(revision 323636)
X +++ em_txrx.c	(working copy)
X @@ -640,9 +640,20 @@
X 
X  		/* Make sure bad packets are discarded */
X  		if (errors & E1000_RXD_ERR_FRAME_ERR_MASK) {
X +#if 0
X  			adapter->dropped_pkts++;
X -			/* XXX fixup if common */
X  			return (EBADMSG);
X +#else
X +			/*
X +			 * XXX the above error handling is worse than none.
X +			 * First it it drops 'i' packets before the current
X +			 * one and doesn't count them.  Then it returns an
X +			 * error.  iflib can't really handle this error.
X +			 * It just resets, and this usually drops many more
X +			 * packets (without counting them) and much time.
X +			 */
X +			printf("lem: frame error: ignored\n");
X +#endif
X  		}
X 
X  		ri->iri_frags[i].irf_flid = 0;

This is for old em.  nfs doesn't seem to notice the dropped packet(s) after
this.

I think the comment "fixup if common" means "this error should actually
be handled if it occurs enough to matter".

I removed the increment of the dropped packet count because with the change
none are dropped directly here.  I think the error is just for this packet
but more than 1 packet might be dropped by returning in the old code, but
debugging code seem to show no more than 1 packet at a time having an error.
I think returning drops good packets after the bad one together with leaving
the state inconsistent, and it takes almost a reset to recover.

X @@ -703,8 +714,12 @@
X 
X  		/* Make sure bad packets are discarded */
X  		if (staterr & E1000_RXDEXT_ERR_FRAME_ERR_MASK) {
X +#if 0
X  			adapter->dropped_pkts++;
X  			return EBADMSG;
X +#else
X +			printf("em: frame error: ignored\n");
X +#endif
X  		}
X 
X  		ri->iri_frags[i].irf_flid = 0;

This is for newer em.  I haven't noticed any problems with that (except it
has 27 usec higher latency).

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170916192800.E14782>