Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 24 Feb 2019 01:23:11 +0100
From:      Andreas Kempe <kempe@lysator.liu.se>
To:        freebsd-net@freebsd.org
Subject:   Infiniband: Mellanox MT26418 in ethernet mode causes crash on shutdown
Message-ID:  <8763252f-d433-5e1e-9e3b-628e0545c8eb@lysator.liu.se>

next in thread | raw e-mail | index | archive | help
This is a multi-part message in MIME format.
--------------7E2ABC1F3783C5094D044441
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

Hello,

When running a Mellanox MT26418 in ethernet mode, the kernel crashes
with the following stack trace on system shutdown:

> Fatal trap 12: page fault while in kernel mode
> cpuid = 0; apic id = 00
> fault virtual address   = 0x0
> fault code      = supervisor read data, page not present
> instruction pointer = 0x20:0xffffffff80e3f5f4
> stack pointer           = 0x28:0xfffffe064abec6e0
> frame pointer           = 0x28:0xfffffe064abec700
> code segment        = base 0x0, limit 0xfffff, type 0x1b
>             = DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags    = interrupt enabled, resume, IOPL = 0
> current process     = 1 (init)
> trap number     = 12
> panic: page fault
> cpuid = 0
> KDB: stack backtrace:
> #0 0xffffffff80b4c5b7 at kdb_backtrace+0x67
> #1 0xffffffff80b05b57 at vpanic+0x177
> #2 0xffffffff80b059d3 at panic+0x43
> #3 0xffffffff8106efdf at trap_fatal+0x35f
> #4 0xffffffff8106f039 at trap_pfault+0x49
> #5 0xffffffff8106e807 at trap+0x2c7
> #6 0xffffffff8104f03c at calltrap+0x8
> #7 0xffffffff80e3fae2 at mlx4_en_stop_port+0x3d2
> #8 0xffffffff80e40ff6 at mlx4_en_destroy_netdev+0x1e6
> #9 0xffffffff80e3e47d at mlx4_en_remove+0xcd
> #10 0xffffffff80e1ab01 at mlx4_remove_device+0xb1
> #11 0xffffffff80e1b0b8 at mlx4_unregister_device+0x98
> #12 0xffffffff80e1c5c5 at mlx4_unload_one+0x85
> #13 0xffffffff80e23543 at mlx4_shutdown+0x83
> #14 0xffffffff80d6b6e9 at linux_pci_shutdown+0x39
> #15 0xffffffff80b4004a at bus_generic_shutdown+0x5a
> #16 0xffffffff80b4004a at bus_generic_shutdown+0x5a
> #17 0xffffffff80b4004a at bus_generic_shutdown+0x5a

I've traced the issue to the following lines of code in
sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c in mlx4_en_destroy_netdev():
>     /* Unregister device - this will close the port if it was up */
>     if (priv->registered) {
>         mutex_lock(&mdev->state_lock);
>         ether_ifdetach(dev);
>         mutex_unlock(&mdev->state_lock);
>    }>>     mutex_lock(&mdev->state_lock);
>     mlx4_en_stop_port(dev);
>     mutex_unlock(&mdev->state_lock);
> 

The issue is that mlx4_en_stop_port() follows the fcall chain below and
tries to fetch the MAC address of the device in mlx4_en_put_qp.
mlx4_en_destroy_netdev->mlx4_en_stop_port->mlx4_en_put_qp

The sequence above causes the kernel to choke because the MAC address
was freed in the previous call to ether_ifdetach in if_detach_internal
with the following call chain:
mlx4_en_destroy_netdev->ether_ifdetach->if_detach->if_detach_internal

I've written a small workaround that works on our test machine, although
I suspect this could potentially cause issues as we're destroying the
port before we destroy the interface. Please see the attached patch for
the workaround.

Cordially,
Andreas Kempe
Lysator ACS

--------------7E2ABC1F3783C5094D044441
Content-Type: text/x-patch;
 name="mlx_destroy_work_around.patch"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
 filename="mlx_destroy_work_around.patch"

--- sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c.old	2019-02-24 01:01:54.7593070=
00 +0100
+++ sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c	2019-02-24 01:04:07.872558000 +=
0100
@@ -1764,16 +1764,19 @@
 	if (priv->vlan_detach !=3D NULL)
 		EVENTHANDLER_DEREGISTER(vlan_unconfig, priv->vlan_detach);
=20
+	/* Bring the interface down before destroying the port. */
+	if_down(dev);
+
+	mutex_lock(&mdev->state_lock);
+	mlx4_en_stop_port(dev);
+	mutex_unlock(&mdev->state_lock);
+
 	/* Unregister device - this will close the port if it was up */
 	if (priv->registered) {
 		mutex_lock(&mdev->state_lock);
 		ether_ifdetach(dev);
 		mutex_unlock(&mdev->state_lock);
 	}
-
-	mutex_lock(&mdev->state_lock);
-	mlx4_en_stop_port(dev);
-	mutex_unlock(&mdev->state_lock);
=20
 	if (priv->allocated)
 		mlx4_free_hwq_res(mdev->dev, &priv->res, MLX4_EN_PAGE_SIZE);

--------------7E2ABC1F3783C5094D044441--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?8763252f-d433-5e1e-9e3b-628e0545c8eb>