From owner-freebsd-current@FreeBSD.ORG  Wed Aug 20 21:19:23 2014
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 46996A59;
 Wed, 20 Aug 2014 21:19:23 +0000 (UTC)
Received: from mail-pa0-x22e.google.com (mail-pa0-x22e.google.com
 [IPv6:2607:f8b0:400e:c03::22e])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 11AC73776;
 Wed, 20 Aug 2014 21:19:23 +0000 (UTC)
Received: by mail-pa0-f46.google.com with SMTP id lj1so12898170pab.5
 for <multiple recipients>; Wed, 20 Aug 2014 14:19:22 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=sender:message-id:date:from:user-agent:mime-version:to:subject
 :references:in-reply-to:content-type:content-transfer-encoding;
 bh=wC3I/mNwfLeWhtE91XDSY9x3swf3q0V2anpuPnmDWHM=;
 b=lFrm3J9i9EARyasiC+gsrR20gCmEMn5sogrZ6fNK5KGB2E12nQtfmuQhzQpR+so0fj
 KzmKWfis5/7S/6nzq00uTEZsBx4e2viSjoltXGTjpx01aidtRvpkGigz/kiOwei+djqx
 SDywElYLdRbXJLzcbViQ/zd19q5io5pJp3xlDYBSydhaAnHbLJdS7GtUk2JJuM0rD1eR
 4ZftptVXdo15s1nWYSJhmca/gKuAWGPqlECPfea6Ba94QDO77wwGuF6yW6jtHhL2/p0s
 1YZKGwZg9Jzhy+lIUa/194WG50P8G+34+rGT3HsYoimLEkbAqsK9acMr6lS5VxE9j2HN
 ZYCg==
X-Received: by 10.70.128.17 with SMTP id nk17mr6082686pdb.89.1408569562649;
 Wed, 20 Aug 2014 14:19:22 -0700 (PDT)
Received: from [10.192.166.0] (stargate.chelsio.com. [67.207.112.58])
 by mx.google.com with ESMTPSA id a5sm35523341pdp.38.2014.08.20.14.19.21
 for <multiple recipients>
 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Wed, 20 Aug 2014 14:19:21 -0700 (PDT)
Sender: Navdeep Parhar <nparhar@gmail.com>
Message-ID: <53F510D8.4060801@FreeBSD.org>
Date: Wed, 20 Aug 2014 14:19:20 -0700
From: Navdeep Parhar <np@FreeBSD.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:31.0) Gecko/20100101 Thunderbird/31.0
MIME-Version: 1.0
To: Hans Petter Selasky <hps@selasky.org>, freebsd-net@freebsd.org, 
 FreeBSD Current <freebsd-current@freebsd.org>
Subject: Re: [RFC] Add support for hardware transmit rate limiting queues
 [WAS: Add support for changing the flow ID of TCP connections]
References: <53BC2E73.6090700@selasky.org> <53BC43AE.3040409@FreeBSD.org>
 <53BD5385.4090208@selasky.org> <20140709163146.GA21731@ox>
 <53F44F91.2060006@selasky.org> <53F4EC8A.9090804@FreeBSD.org>
 <53F4F626.9030806@selasky.org>
In-Reply-To: <53F4F626.9030806@selasky.org>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
 <freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current/>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
 <mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 20 Aug 2014 21:19:23 -0000

On 08/20/14 12:25, Hans Petter Selasky wrote:
> On 08/20/14 20:44, Navdeep Parhar wrote:
>> On 08/20/14 00:34, Hans Petter Selasky wrote:
>>> Hi,
>>>
>>> A month has passed since the last e-mail on this topic, and in the
>>> meanwhile some new patches have been created and tested:
>>>
>>> Basically the approach has been changed a little bit:
>>>
>>> - The creation of hardware transmit rings has been made independent of
>>> the TCP stack. This allows firewall applications to forward traffic into
>>> hardware transmit rings aswell, and not only native TCP applications.
>>> This should be one more reason to get the feature into the kernel.
>>>
>>> - A hardware transmit ring basically can have two modes: FIXED-RATE or
>>> AUTOMATIC-RATE. In the fixed rate mode all traffic is sent at a fixed
>>> bytes per second rate. In the automatic mode you can configure a time
>>> after which the TX queue must be empty. The hardware driver uses this to
>>> configure the actual rate. In automatic mode you can also set an upper
>>> and lower transmit rate limit.
>>>
>>> - The MBUF has got a new field in the packet header: "txringid"
>>>
>>> - IOCTLs for TCP v4 and v6 sockets has been updated to allow setting of
>>> the "txringid" field in the mbuf.
>>>
>>> The current patch [see attachment] should be much simpler and less
>>> intrusive than the previous one.
>>>
>>> Any comments ?
>>
>>
>> Here are some thoughts.  The first two bullets cover relatively
>> minor issues, the rest are more important.
>>
>> - All of the mbuf pkthdr fields today have the same meaning no matter
>>    what the context.  It is not clear what txringid's global meaning is.
>>    Is it even possible for driver foo to interpret it the same way as
>>    driver bar?  What if the number of rings are different, or if the ring
>>    at the particular index for foo is setup differently than the ring at
>>    that same index for bar?  You are attempting to influence the driver's
>>    txq selection and traditionally the mbuf's flowid has been used for
>>    this purpose.  Have you considered allowing the user to set the flowid
>>    directly?  And mark it as such via a new rsstype so the kernel will
>>    leave it alone.
>
> Hi,
>
> At work so to speak, we have tried to make a simple approach that will
> not break existing code, without trying to optimise the possibilities
> and reduce memory footprint.
>
>>
>> - uint32_t -> m_flowid_t is plain gratuitous.  Now we need to include
>>    mbuf.h in more places just to get this definition.  What's the
>>    advantage of this?  style(9) isn't too fond of typedefs either.  Also,
>>    drivers *do* need to know the width of the flowid.  At least lagg(4)
>>    looks at the high bits of the flowid (see flowid_shift in lagg).  How
>>    high it can go depends on the width of the flowid.
>
> The flowid should be typedef'ed. Else how can you know its type passing
> flowid along function arguments and so on?

It's just a simple 32 bit unsigned int and all drivers know exactly what
it is.  I don't think we need type checking for trivial stuff like this.
We trust code to do the right thing and that's the correct tradeoff
here, in my opinion.  Or else we'd end up with errno_t, fd_t, etc. and
programming in C would not be fun anymore.  Here's a hyperbolic example:

errno_t socket(domain_t domain, socktype_t type, protocol_t protocol);

(oops, it returns an int -1 or 0 so errno_t is not strictly correct, but
you get my point).

>
>>
>> - Interfaces can come and go, routes can change, and so the relationship
>>    between an inpcb and txringid is not stable at all.  What happens when
>>    the outbound route for an inpcb changes?
>
> This is managed separately by a daemon or such. The problem about using
> the "inpcb" approach which you are suggesting, is that you limit the
> rate control feature to traffic which is bound by sockets. Can your way
> of doing rate control be useful to non-socket based firewall
> applications, for example?
>
> You also assume a 1:1 mapping between "inpcb" and the flowID, right.
> What about M:N mappings, where multiple streams should share the same
> flowID, because it makes more sense?

You're right that an inpcb based scheme won't work for non-socket based
firewall.

inpcb represents an endpoint, almost always with an associated socket,
and it mostly has a 1:1 relation with an n-tuple (SO_LISTEN and UDP
sockets with no default destination are notable exceptions).  If you're
talking of non-socket based firewalls, then where is the inpcb coming
from?  Firewalls typically keep their own state for the n-tuples that
they are interested in.  It almost seems like you need a n-tuple ->
rate_limit mapping scheme instead of inpcb -> rate_limit.

Regards,
Navdeep

>
>>
>> - The in_ratectlreq structure that you propose is inadequate in its
>>    current form.  For example, cxgbe's hardware can do rate limiting on a
>>    per-ring as well as per-connection basis, and it allows for pps,
>>    bandwidth, or min-max limits.  I think this is the critical piece that
>>    we NIC maintainers must agree on before any code hits the core kernel:
>>    how to express a rate-limit policy in a standard way and allow for
>>    hardware assistance opportunistically.  ipfw(4)'s dummynet is probably
>>    interested in this part too, so it's great that Luigi is paying
>>    attention to this thread.
>
> My "in_ratectlreq" is a work in progress.
>
>>
>> - The RATECTL ioctls deal with in_ratectlreq so we need to standardize
>>    the ratectlreq structure before these ioctls can be considered generic
>>    ifnet ioctls.  This is the reason cxgbetool (and not ifconfig) has a
>>    private ioctl to frob cxgbe's per-queue rate-limiters.  I did not want
>>    to add ifnet ioctls that in reality were cxgbe only.  Ditto for i2c
>>    ioctls.  Now we have multiple drivers with i2c and melifaro@ is doing
>>    the right thing by promoting these private ioctls to a standard ifnet
>>    ioctl.  Have you considered a private mlxtool as a stop gap measure?
>
> It might end that we need to create our own tool for this, having vendor
> specific IOCTLs, if we cannot agree how to do this in a general way.
>
>>
>> To summarize my take on all of this: we need a standard ratectlreq
>> structure,
>
> Agree.
>
>> a standard way to associate an inpcb with one,
>
> Maybe.
>
>> and a standard
>> way to pass on this info to if_transmit.
>
> Agree.
>
>> After all this is in place we
>> could even have a dummynet-ish software layer that implements rate
>> limiters when the underlying hardware offers no assistance.
>
> Right.
>
> --HPS