From owner-freebsd-bugs Fri Sep 13 3:40:25 2002 Delivered-To: freebsd-bugs@hub.freebsd.org Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 2262B37B400 for ; Fri, 13 Sep 2002 03:40:04 -0700 (PDT) Received: from freefall.freebsd.org (freefall.FreeBSD.org [216.136.204.21]) by mx1.FreeBSD.org (Postfix) with ESMTP id 33A0843E75 for ; Fri, 13 Sep 2002 03:40:03 -0700 (PDT) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (gnats@localhost [127.0.0.1]) by freefall.freebsd.org (8.12.4/8.12.4) with ESMTP id g8DAe3JU007454 for ; Fri, 13 Sep 2002 03:40:03 -0700 (PDT) (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.12.4/8.12.4/Submit) id g8DAe3oh007453; Fri, 13 Sep 2002 03:40:03 -0700 (PDT) Received: from mx1.FreeBSD.org (mx1.FreeBSD.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3D4D837B400 for ; Fri, 13 Sep 2002 03:31:47 -0700 (PDT) Received: from blueyonder.co.uk (pcow057o.blueyonder.co.uk [195.188.53.94]) by mx1.FreeBSD.org (Postfix) with ESMTP id BEBE143E4A for ; Fri, 13 Sep 2002 03:31:40 -0700 (PDT) (envelope-from dominic@indigo-ic.co.uk) Received: from pcow057o.blueyonder.co.uk ([127.0.0.1]) by blueyonder.co.uk with Microsoft SMTPSVC(5.5.1877.757.75); Fri, 13 Sep 2002 11:31:39 +0100 Received: from the-mayor.dom (unverified [62.31.234.90]) by pcow057o.blueyonder.co.uk (Content Technologies SMTPRS 4.2.9) with ESMTP id for ; Fri, 13 Sep 2002 11:31:39 +0100 Received: from the-mayor.dom (localhost [127.0.0.1]) by the-mayor.dom (8.12.3/8.12.3) with ESMTP id g8DAVcAK006141 for ; Fri, 13 Sep 2002 11:31:38 +0100 (BST) (envelope-from dominic@indigo-ic.co.uk) Received: (from root@localhost) by the-mayor.dom (8.12.3/8.12.3/Submit) id g8DAV0QI006119; Fri, 13 Sep 2002 11:31:00 +0100 (BST) Message-Id: <200209131031.g8DAV0QI006119@the-mayor.dom> Date: Fri, 13 Sep 2002 11:31:00 +0100 (BST) From: Dominic Froud Reply-To: Dominic Froud To: FreeBSD-gnats-submit@FreeBSD.org X-Send-Pr-Version: 3.113 Subject: kern/42727: [PATCH] Wrong MTU in need-frag ICMP using IPSEC tunnels w/out GIF Sender: owner-freebsd-bugs@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org >Number: 42727 >Category: kern >Synopsis: [PATCH] Wrong MTU in need-frag ICMP using IPSEC tunnels w/out GIF >Confidential: no >Severity: serious >Priority: medium >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Fri Sep 13 03:40:02 PDT 2002 >Closed-Date: >Last-Modified: >Originator: Dominic Froud >Release: FreeBSD 4.6-RELEASE i386 >Organization: >Environment: System: FreeBSD the-mayor.dom 4.6-RELEASE FreeBSD 4.6-RELEASE #17: Wed Sep 11 17:13:53 BST 2002 root@the-mayor.dom:/usr/src/sys/compile/SERVER i386 Kernel options: INET INET6 IPSEC IPSEC_ESP IPSEC_DEBUG MROUTING IPFIREWALL IPFIREWALL_FORWARD IPDIVERT RANDOM_IP_ID ICMP_BANDLIM Server has two Macronix 98715AEC-C 10/100BaseTX cards at dc0 and dc1. net.inet.ipsec.dfbit=1 >Description: I bridged my LAN (subnet 10.0.1.0/24) with a friend's LAN (10.0.0.0/24) using IPSEC tunnels without GIF devices. I use FreeBSD 4.6 and he uses Linux RedHat 7.x. My friend couldn't pull any packets from machines on my LAN that required MTU reduction to prevent fragmentation, e.g. SMB TCP packets. Upon further inspection, my FreeBSD server was telling the machine on my LAN that fragmentation was needed but was suggesting an incorrect MTU of 1500 instead of one that took the IPSEC tunnel headers into account. This would cause the machine on my LAN to simply retry the same over-sized packet again and again, causing the requesting machine on his LAN to eventually timeout with a short read. [The short read timeout problem is a common symptom of other MTU issues but this specific issue can be accurately diagnosed]. There is code in netinet/ip_input.c:ip_forward() that should deal with this but it never has the chance to do the calculation as some prior IPSEC function call returns an error. In ip_forward(), before ip_output() called, a rough copy of the top mbuf at 'm' is made and pointed to by 'mcopy'. Only the IP header and up to 8 bytes are copied - but the length as stored in the packet header (m_pkthdr) remains unchanged and reflects the original packet length. If ip_forward()'s call to ip_output() fails with EMSGSIZE and the packet would have transversed an IPSEC tunnel, then ipsec_setspidx() in netinet6/ipsec.c would (eventually) be called. This would sanity check the passed mbuf and fail with an error like: "ipsec_setspidx: total of m_len(28) != pkthdr.len(1500), ignored." The 28 is obviously the truncated length of mcopy (IP header + max 8 bytes) and the 1500 is the size of the original packet. Hence the rest of the reduced MTU calculation would be stopped at this point and an unchanged MTU used to construct the ICMP frag-needed packet. >How-To-Repeat: Bridge two subnets using IPSEC tunnels without the GIF device. If you bridge the encapsulating machines themselves as well, you should end up with 8 policies like the following: 81.5.133.243[any] 10.0.1.0/24[any] any in ipsec esp/tunnel/81.5.133.243-62.31.234.90/require spid=1 seq=7 pid=235 refcnt=1 81.5.133.243[any] 62.31.234.90[any] any in ipsec esp/tunnel/81.5.133.243-62.31.234.90/require spid=3 seq=6 pid=235 refcnt=1 10.0.0.0/24[any] 10.0.1.0/24[any] any in ipsec esp/tunnel/81.5.133.243-62.31.234.90/require spid=5 seq=5 pid=235 refcnt=1 10.0.0.0/24[any] 62.31.234.90[any] any in ipsec esp/tunnel/81.5.133.243-62.31.234.90/require spid=7 seq=4 pid=235 refcnt=1 10.0.1.0/24[any] 81.5.133.243[any] any out ipsec esp/tunnel/62.31.234.90-81.5.133.243/require spid=2 seq=3 pid=235 refcnt=1 62.31.234.90[any] 81.5.133.243[any] any out ipsec esp/tunnel/62.31.234.90-81.5.133.243/require spid=4 seq=2 pid=235 refcnt=1 10.0.1.0/24[any] 10.0.0.0/24[any] any out ipsec esp/tunnel/62.31.234.90-81.5.133.243/require spid=6 seq=1 pid=235 refcnt=1 62.31.234.90[any] 10.0.0.0/24[any] any out ipsec esp/tunnel/62.31.234.90-81.5.133.243/require spid=8 seq=0 pid=235 refcnt=1 I am 62.31.234.90 with protected subnet 10.0.1.0/24. Peer is 81.5.133.243 with protected subnet 10.0.0.0/24. I also have net.inet.ipsec.dfbit set to 1 via /etc/sysctl.conf. I logged into peer's server and used smbclient to request a file from 10.0.1.20 (win98se machine). Just each test, make sure all your SAD entries are 'mature' and relatively fresh (i.e. not about to die on you during your test) using "setkey -D | egrep '(diff|state)'". Use tcpdump to log data packets from, and icmp packets to, your protected host (in my case this was 10.0.1.20). Increase IPSEC logging using "sysctl net.key.debug=0x45". To turn these messages off, use "sysctl net.key.debug=0". Now try to transfer a file from your target host that is bigger than your MTU (>1500 so say, 16Kbytes). tcpdump will produce output like: 11:44:03.378193 10.0.1.20.139 > 81.5.133.243.43396: tcp 1460 (DF) (ttl 128, id 26226, len 1500) 11:44:03.387030 10.0.1.2 > 10.0.1.20: icmp: 81.5.133.243 unreachable - need to frag (mtu 1500) (DF) (ttl 64, id 48070, len 56) 11:44:04.778191 10.0.1.20.139 > 81.5.133.243.43396: tcp 1460 (DF) (ttl 128, id 26226, len 1500) 11:44:04.787022 10.0.1.2 > 10.0.1.20: icmp: 81.5.133.243 unreachable - need to frag (mtu 1500) (DF) (ttl 64, id 48070, len 56) (pattern repeats) Your console should show lines like: Sep 10 11:44:03 the-mayor /kernel: ipsec_setspidx: total of m_len(28) != pkthdr.len(1500), ignored. The requesting host on the remote LAN will timeout. >Fix: Simply update the packet length in mcopy->m_pkthdr.len to reflect the truncated nature of mcopy. This can be done at line 1799 in netinet/ip_input.c rev 1.130.2.35 for just the EMSGSIZE IPSEC case or at line 1703 if this is of more general use within ip_forward() and functions called by it. I've tried the following diff at both line 1703 and line 1799 and both cure the problem as expected. On my machine, I've left the code in at line 1799 because I don't know if other code using mcopy makes use of the original packet length. --- patch begins here --- --- ip_input.c Wed Sep 11 17:55:09 2002 +++ ip_input.c-patched Wed Sep 11 18:23:47 2002 @@ -1796,6 +1796,13 @@ int ipsechdr; struct route *ro; + /* Pretend original packet was only this long + * as IPSEC functions like ipsec_setspidx(), + * called by ispec4_getpolicybyaddr() below, + * expect a sane mbuf chain. + */ + mcopy->m_pkthdr.len = mcopy->m_len; + sp = ipsec4_getpolicybyaddr(mcopy, IPSEC_DIR_OUTBOUND, IP_FORWARDING, --- patch ends here --- tcpdump with patched kernel looks like: 17:17:43.108193 10.0.1.20.139 > 81.5.133.243.43396: tcp 1460 (DF) (ttl 128, id 26226, len 1500) 17:17:43.108779 10.0.1.20.139 > 81.5.133.243.43396: tcp 652 (DF) (ttl 128, id 26482, len 692) 17:17:43.114394 10.0.1.2 > 10.0.1.20: icmp: 81.5.133.243 unreachable - need to frag (mtu 1443) (DF) (ttl 64, id 39851, len 56) 17:17:43.115869 10.0.1.20.139 > 81.5.133.243.43396: tcp 1403 (DF) (ttl 128, id 26738, len 1443) >Release-Note: >Audit-Trail: >Unformatted: To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-bugs" in the body of the message