From owner-freebsd-net@FreeBSD.ORG  Sat Dec 14 05:04:58 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 775DDE91
 for <freebsd-net@freebsd.org>; Sat, 14 Dec 2013 05:04:58 +0000 (UTC)
Received: from mail-oa0-x235.google.com (mail-oa0-x235.google.com
 [IPv6:2607:f8b0:4003:c02::235])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 44C4416C0
 for <freebsd-net@freebsd.org>; Sat, 14 Dec 2013 05:04:58 +0000 (UTC)
Received: by mail-oa0-f53.google.com with SMTP id m1so3039602oag.12
 for <freebsd-net@freebsd.org>; Fri, 13 Dec 2013 21:04:57 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:date:message-id:subject:from:to:content-type;
 bh=kl/0b3vzUbj1TE9U+5smN2hdv7LqOSz/AGsaGm6vO/w=;
 b=abTXphmDTHwoAEPlUIYGwQWwo7G3k9Jdq3oaQSy+Ditj98e50DYYXQbVSbl8D2tGh/
 mXuYlrxB/Tsw7W0ZxgJwxEmIS1Wb4bKUwI/jwQPzO8ZyrOjcWUhMuPbNPcrc76cCrPqH
 Dz9wAF95tUFYWvMUjGLyPvDUZREkS8jodpDzmWr+RdBOmiSX4tRooPo4mRpslsAtBAEw
 OPAsCpaaELbtvtNuwY9CCn4wIGWckTvrN3Zyqtl1dW6zR3NmF/+8hKh6WwD2gamuuQSz
 ClNXL31l4tkcbHuEKWL7I57eXrM6GLjmvFlonHDI4V8cPQs5MSBaPhCNLNPNXfeoqU0m
 SjZw==
MIME-Version: 1.0
X-Received: by 10.60.51.102 with SMTP id j6mr4230404oeo.6.1386997497483; Fri,
 13 Dec 2013 21:04:57 -0800 (PST)
Received: by 10.76.158.225 with HTTP; Fri, 13 Dec 2013 21:04:57 -0800 (PST)
Date: Sat, 14 Dec 2013 00:04:57 -0500
Message-ID: <CAFMmRNyJpvZ0AewWr62w16=qKer+FNXUJJy0Qc=EBqMnUV3OyQ@mail.gmail.com>
Subject: buf_ring in HEAD is racy
From: Ryan Stone <rysto32@gmail.com>
To: freebsd-net <freebsd-net@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 14 Dec 2013 05:04:58 -0000

I am seeing spurious output packet drops that appear to be due to
insufficient memory barriers in buf_ring.  I believe that this is the
scenario that I am seeing:

1) The buf_ring is empty, br_prod_head = br_cons_head = 0
2) Thread 1 attempts to enqueue an mbuf on the buf_ring.  It fetches
br_prod_head (0) into a local variable called prod_head
3) Thread 2 enqueues an mbuf on the buf_ring.  The sequence of events
is essentially:

Thread 2 claims an index in the ring and atomically sets br_prod_head (say to 1)
Thread 2 sets br_ring[1] = mbuf;
Thread 2 does a full memory barrier
Thread 2 updates br_prod_tail to 1

4) Thread 2 dequeues the packet from the buf_ring using the
single-consumer interface.  The sequence of events is essentialy:

Thread 2 checks whether queue is empty (br_cons_head == br_prod_tail),
this is false
Thread 2 sets br_cons_head to 1
Thread 2 grabs the mbuf from br_ring[1]
Thread 2 sets br_cons_tail to 1

5) Thread 1, which is still attempting to enqueue an mbuf on the ring.
fetches br_cons_tail (1) into a local variable called cons_tail.  It
sees cons_tail == 1 but prod_head == 0 and concludes that the ring is
full and drops the packet (incrementing br_drops unatomically, I might
add)


I can reproduce several drops per minute by configuring the ixgbe
driver to use only 1 queue and then sending traffic from concurrent 8
iperf processes.  (You will need this hacky patch to even see the
drops with netstat, though:
http://people.freebsd.org/~rstone/patches/ixgbe_br_drops.diff)

I am investigating fixing buf_ring by using acquire/release semantics
rather than load/store barriers.  However, I note that this will
apparently be the second attempt to fix buf_ring, and I'm seriously
questioning whether this is worth the effort compared to the
simplicity of using a mutex.  I'm not even convinced that a correct
lockless implementation will even be a performance win, given the
number of memory barriers that will apparently be necessary.