Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 14 Jan 2010 14:46:35 -0800
From:      Erik Klavon <erikk@berkeley.edu>
To:        freebsd-net@freebsd.org
Subject:   netgraph mkpeer and connect failures with ng_ipfw and ng_nat
Message-ID:  <20100114224635.GA27139@malcolm.berkeley.edu>

next in thread | raw e-mail | index | archive | help
Hi

I have several dual processor, single core amd64 machines running
8.0p1. These machines use netgraph to implement one to one NAT with
one ng_nat(4) node for each network client behind them. ipfw(8) rules
direct traffic to netgraph nodes as needed based on table entries
using an ng_ipfw(4) node. As clients join and leave the network,
scripts dynamically create and delete NAT sessions by making calls to
ipfw(8), ngctl(8) and arp(8) to create and delete ng_nat(4) nodes,
ipfw(8) table and published ARP entries. This scheme has worked well
so far in all places we're using it except one.

In the problematic case, immediately following boot up, creation and
deletion of sessions work consistently. After a couple of hours, long
enough for roughly 100 session creations and deletions, creation and
deletion of sessions will occasionally fail during calls to
ngctl(8). In all other cases, creation and deletion of sessions work
consistently. The rate of session change appears to be greater in the
problematic case, though the total number of active sessions when
session creation and deletion start to fail is below the total number
of sessions we see in non problematic cases. I have swapped out the
machine in the problematic case with another machine with an identical
configuration. This did not result in a change in behavior.

Here are the calls to ngctl(8) we use to create a one to one NAT
session, with the first two octets in the globally routable IPv4
addresses replaced with an "x".

ngctl mkpeer ipfw: nat 202182094 out
ngctl name ipfw:202182094 WirelessNAT2182094
ngctl connect ipfw: WirelessNAT2182094: 102182094 in
ngctl msg WirelessNAT2182094: setaliasaddr x.x.182.94
ngctl msg WirelessNAT2182094: redirectaddr \
'{'"local_addr=10.10.115.242" "alias_addr=x.x.182.94" \
'description="Static NAT"' '}'

For the identifiers above, we represent the globally routable IPv4
address by an integer such as 2182094. The first digit of this integer
represents the first two octets in the address (all addresses used
have the same first two octets). The remaining digits are the third
and forth octets with padded leading zeros as needed. We generate the
hook names by prepending a two digit int representing the direction
(20 out, 10 in) to the integer representing the globally routable IPv4
address. This guarantees unique identifiers for all hooks and ng_nat(4)
node names. We use multiple checks in the script that calls ngctl(8) to
ensure that each globally routable address is used in exactly one
session. We use a lock around the above calls prevent two session
creation attempts from interleaving their ngctl(8) calls.

Here is the call to ngctl(8) we use to delete a one to one NAT
session. The lock around this call is the same used for creating a
session.

ngctl shutdown ipfw:202182094

The first error leading to a session creation or deletion failure
after booting I've observed is in the mkpeer call. Here is an example
from yesterday, after booting the machine at roughly 0500.

Jan 13 08:44:42 nac[11721]: ngctl: send msg: File exists
Jan 13 08:44:42 nac[11722]: warning: Unsuccessful execution: /usr/sbin/ngctl mkpeer ipfw: nat 202182094 out

This mkpeer call is the first time since boot that the globally
routable IP address x.x.182.94 was used and the first attempt to
create a hook named 202182094. Output from vmstat(8) -z captured every
minute shows no incrementing of the allocation failure
counters from well before 0844 to 0848. Reducing the number of
netgraph threads (from 2 to 1) via the sysctl net.graph.threads has
not had an effect on this problem. I can not think of any obvious
reasons why this mkpeer call might have failed. 102 mkpeer calls prior
to it did not.

When a mkpeer call succeeds, the subsequent connect call in the above
session creation sequence will fail, also with a "File exists"
error. The next failure after the above example from yesterday is an
example of this problem.

Jan 13 08:51:45 nac[51209]: ngctl: send msg: File exists
Jan 13 08:51:45 nac[51212]: warning: Unsuccessful execution: /usr/sbin/ngctl connect ipfw: WirelessNAT2175205: 102175205 in

Also in this case, this is the first time since boot that the globally
routable IP address x.x.175.205 was used and the first attempt to
create a hook named 102175205. Output from vmstat(8) -z captured every
minute shows no incrementing of the allocation failure
counters from 0849 to 0900. There is an increment of the allocation
failure counter for the 128 Bucket in vmstat(8) -z between 0848 and 0849;
I assume that this is too far away in time from the two above events
to be related. Also, the "File exists" error results from a check
I don't think is related to memory allocation (see below).

This morning prior to a reboot I modified the ngctl(8) calls to include
the -d flag and obtained the following additional information.

Jan 14 08:35:08 nac[3234]: ngctl: sendto(ipfw:): File exists
Jan 14 08:35:08 nac[3234]: ngctl: send msg: File exists
Jan 14 08:35:08 nac[3235]: warning: Unsuccessful execution: /usr/sbin/ngctl -d mkpeer ipfw: nat 202183148 out

Jan 14 08:47:50 nac[94270]: ngctl: sendto(ipfw:): File exists
Jan 14 08:47:50 nac[94270]: ngctl: send msg: File exists
Jan 14 08:47:50 nac[94271]: warning: Unsuccessful execution: /usr/sbin/ngctl -d connect ipfw: WirelessNAT2175111: 102175111 in

This points to ng_ipfw(4) as the source of these errors. I took a very
quick look at the code to come up with the following interpretations;
please let me know if you see any mistakes in the following. It looks
to me like the error string "File exists" is displayed by ngctl(8) when
the EEXIST constant is returned from ng_ipfw(4) when processing the
command issued via ngctl(8). There is one place in ng_ipfw(4) where EEXIST
is returned. This section of code appears to be used only when the
single ipfw(8) node is first created, so I don't think it should be the
source of the problem in this case. This leaves the parts of ng_base
used to handle generic operations. EEXIST is returned by ng_base.c in
the following functions.

ng_newtype (called when a new netgraph type is installed)
ng_add_hook (add an unconnected hook to a node)
ng_con_nodes (connect this node with another node)
ng_con_part2 (make the connection in the opposite direction of ng_con_nodes)

ng_newtype probably isn't the source of the problem in this case,
since we're not working with a new type. The other three functions are
possible candidates for each of the above two error examples as in
both cases hooks are created and connected. Here are the conditional
statements that check for an EEXIST condition. 

ng_add_hook	if (ng_findhook(node, name) != NULL) {
ng_con_nodes	if (ng_findhook(node2, name2) != NULL) {
ng_con_part2	if (ng_findhook(node, NG_HOOK_NAME(hook)) != NULL) {

In each of these cases, it appears that the code is checking to see if
a hook with the argument name exists before creating a hook with that
name. Given the conditions in the above two example errors that the
hook names are unique and no hooks with these same names were
previously created, I can't explain why the above calls to ng_findhook
are returning references to hooks.

Here are the hooks for one ng_nat(4) node of interest. At the time I
obtained this information, this node was working correctly. Everything
in this output looks correct.

sudo ngctl show ipfw:202182138
  Name: WirelessNAT2182138 Type: nat             ID: 000062ee   Num hooks: 2
  Local hook      Peer name       Peer type    Peer ID         Peer hook      
  ----------      ---------       ---------    -------         ---------      
  in              ipfw            ipfw         00005a83        102182138      
  out             ipfw            ipfw         00005a83        202182138      

sudo ngctl msg ipfw:102182138 listredirects
Rec'd response "listredirects" (10) from "[62ee]:":
Args:   { total_count=1 redirects=[ { id=1 local_addr=10.10.118.43 alias_addr=136.152.182.138 proto=259 description="Static NAT" } ] }

Running show on ipfw:102174202, I see that this hook is pointing to
the above ng_nat(4) node WirelessNAT2182138.

sudo ngctl show ipfw:102174202
  Name: WirelessNAT2182138 Type: nat             ID: 000062ee   Num hooks: 2
  Local hook      Peer name       Peer type    Peer ID         Peer hook      
  ----------      ---------       ---------    -------         ---------      
  in              ipfw            ipfw         00005a83        102182138      
  out             ipfw            ipfw         00005a83        202182138      

But WirelessNAT2182138 has no record of a hook102174202. Somehow, two
hooks used to refer to what should be two different NAT sessions are
pointing to the same ng_nat(4) object (i.e. one session). If I run
shutdown on ipfw:102174202, WirelessNAT2182138 goes away. Given the
above calls to ngctl(8), I don't know what is causing these two separate
hooks to refer to the same ng_nat(4) object.

I welcome any suggestions for what other information I should collect
to help debug this problem, ideas for fixing it, or corrections to my
above statements. Thanks in advance,

Erik



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100114224635.GA27139>