Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 8 Apr 2004 22:36:18 +0300
From:      Ruslan Ermilov <ru@FreeBSD.org>
To:        Bruce Evans <bde@zeta.org.au>
Cc:        net@FreeBSD.org
Subject:   Re: sk ethernet driver: watchdog timeout
Message-ID:  <20040408193618.GA1919@ip.net.ua>
In-Reply-To: <20040407235838.K11719@gamplex.bde.org>
References:  <20240000.1079394807@palle.girgensohn.se> <wpy8q04buf.fsf@heho.snv.jussieu.fr> <3810000.1081299464@palle.girgensohn.se> <20040407235838.K11719@gamplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help

--ew6BAiZeqk4r7MaW
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Apr 08, 2004 at 12:17:06AM +1000, Bruce Evans wrote:
[...]
> The following patch reduces the problem on A7V8X-E a little.  It limits
> the tx queue to 1 packet and fixes handling of the timeout on txeof.
> The first part probably makes the second part a no-op.  Without this,
> my A7V8X-E hangs on even light nfs activity (e.g., copying a 1MB file
> to nfs).  With it, it takes heavier nfs activity to hang (makeworld
> never completes, and a flood ping always hangs).
>=20
> I first suspected an interrupt-related bug, but the bug seems to be
> more hardware-specific.  Examination of the output queues shows that
> the tx sometimes just stops before processing all packets.  Resetting
> in sk_watchdog() doesn't always fix the problem, and the timeout usually
> stops firing after a couple of unsuccessful resets, giving a completely
> hung device.  But the problem may be related to interrupt timing, since
> it is much smaller under RELENG_4.  RELENG_4 hangs about as often
> without this hack as -current does with it.
>=20
> nv0 hangs similarly.  fxp0 just works.
>=20
> %%%
> Index: if_sk.c
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> RCS file: /home/ncvs/src/sys/pci/if_sk.c,v
> retrieving revision 1.78
> diff -u -2 -r1.78 if_sk.c
> --- if_sk.c	31 Mar 2004 12:35:51 -0000	1.78
> +++ if_sk.c	1 Apr 2004 07:33:58 -0000
> @@ -1830,4 +1830,9 @@
>  	SK_IF_LOCK(sc_if);
>=20
> +	if (sc_if->sk_cdata.sk_tx_cnt > 0) {
> +		SK_IF_UNLOCK(sc_if);
> +		return;
> +	}
> +
>  	idx =3D sc_if->sk_cdata.sk_tx_prod;
>=20
> @@ -1853,4 +1858,5 @@
>  		 */
>  		BPF_MTAP(ifp, m_head);
> +		break;
>  	}
>=20
> @@ -2000,5 +2031,4 @@
>  		sc_if->sk_cdata.sk_tx_cnt--;
>  		SK_INC(idx, SK_TX_RING_CNT);
> -		ifp->if_timer =3D 0;
>  	}
>=20
> @@ -2007,4 +2037,6 @@
>  	if (cur_tx !=3D NULL)
>  		ifp->if_flags &=3D ~IFF_OACTIVE;
> +
> +	ifp->if_timer =3D (sc_if->sk_cdata.sk_tx_cnt =3D=3D 0) ? 0 : 5;
>=20
>  	return;
> %%%
>=20
Always recharging the timer to 5 when there's some TX work still
left is a bug.  With DEVICE_POLLING (yes, I have plans to add
polling(4) support for sk(4) too), sk_txeof() will be called
periodically, and if the card gets stuck, the if_timer will
never downgrade to zero, and sk_watchdog() will never be called.
Without DEVICE_POLLING, recharging it back to 5 even when
if_timer reaches 0 is still pointless, because when if_timer is
0 while in the sk_txeof(), it means it's called by sk_watchdog()
which will reinit the card and both RX and TX lists, making them
empty, so having the if_timer with the value of 5 _after_
executing the watchdog cleaning and having _no_ TX activity at
all may cause a second (false) watchdog.  My version of the
TX fixes (which also fixes resetting of IFF_OACTIVE):

%%%
Index: if_sk.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
RCS file: /home/ncvs/src/sys/pci/if_sk.c,v
retrieving revision 1.78
diff -u -p -r1.78 if_sk.c
--- if_sk.c	31 Mar 2004 12:35:51 -0000	1.78
+++ if_sk.c	8 Apr 2004 19:10:50 -0000
@@ -1998,14 +1998,14 @@ sk_txeof(sc_if)
 			sc_if->sk_cdata.sk_tx_chain[idx].sk_mbuf =3D NULL;
 		}
 		sc_if->sk_cdata.sk_tx_cnt--;
+		ifp->if_flags &=3D ~IFF_OACTIVE;
 		SK_INC(idx, SK_TX_RING_CNT);
-		ifp->if_timer =3D 0;
 	}
=20
 	sc_if->sk_cdata.sk_tx_cons =3D idx;
=20
-	if (cur_tx !=3D NULL)
-		ifp->if_flags &=3D ~IFF_OACTIVE;
+	if (sc_if->sk_cdata.sk_tx_cnt =3D=3D 0)
+		ifp->if_timer =3D 0;
=20
 	return;
 }
%%%

We have been running the 3COM 3C940 card on 4.9 (and from today
on 4.10-BETA) without any problems and under a heavy TX load.


Cheers,
--=20
Ruslan Ermilov
ru@FreeBSD.org
FreeBSD committer

--ew6BAiZeqk4r7MaW
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (FreeBSD)

iD8DBQFAdamyUkv4P6juNwoRAv3lAJ4pOtf3uOhCykrrHmGz7O+IPAFHtACeLBs4
Z7+c3DoQnN1htNLZTmLkw08=
=qgbu
-----END PGP SIGNATURE-----

--ew6BAiZeqk4r7MaW--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20040408193618.GA1919>