Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 28 Sep 2006 23:52:09 +0200
From:      Daniel Hartmeier <daniel@benzedrine.cx>
To:        Rolf Grossmann <grossman@progtech.net>
Cc:        freebsd-pf@freebsd.org
Subject:   Re: BAD state/State failure with large number of requests
Message-ID:  <20060928215208.GC25341@insomnia.benzedrine.cx>
In-Reply-To: <200609282130.k8SLUmU8089296@progtech.net>
References:  <200609282130.k8SLUmU8089296@progtech.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Sep 28, 2006 at 11:30:48PM +0200, Rolf Grossmann wrote:

> Sep 28 23:56:56 balancer kernel: pf: BAD state: TCP 10.1.1.2:8080 10.25.0.41:8080 10.25.0.100:52209 [lo=2341692840 high=2341759447 win=33304 modulator=0 wscale=1] [lo=2919421554 high=2919488162 win=33304 modulator=0 wscale=1] 9:9 S seq=2345137961 ack=2919421554 len=0 ackskew=0 pkts=6:5 dir=in,fwd
> Sep 28 23:56:56 balancer kernel: pf: State failure on: 1       | 5

This means there is an existing state entry from an old (and already
closed) connection, and the client is re-using its source port 52209 for
a new connection attempt (it's a SYN packet that triggered the log
message).

The client is not honouring the 2MSL quiet period, the time it should
wait before re-using the same source port to connect to the same
destination address/port, as required by the TCP RFCs.

The reason for that is quite likely that it has run out of random high
source ports. The range used should be about 49152-65536 (sysctl
net.inet.ip.portrange.*), and 10,000 connections is getting close. The
client stack can either make ap fail in connect(2), or re-use source ports
and violate the RFCs in this case.

Not sure if this is a realistic test, i.e. whether you see the very same
problem in production (with 'BAD state' messages for SYN packets), it
would only occur if one client is establishing connections to the same
server port at high concurrency and/or rate. If not, I'd say the test is
simply flawed, and you need multiple clients to simulate realistically.

pf keeps state entries around for a while after a connection has been
closed (to catch packets related to the old connection that might arrive
late), the timeout is tcp.closed, 90s by default. You can make pf purge
such state entries sooner by lowering this timeout.

This most likely has nothing to do with rdr and load-balancing. The
difference between enabling and disabling your rdr rule is basically
that of filtering statefully vs. statelessly. Your 'pass all' rule does
not create state, while the rdr will automatically create state.

Daniel



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20060928215208.GC25341>