From owner-freebsd-ipfw@FreeBSD.ORG Sun Mar 23 18:52:16 2008 Return-Path: Delivered-To: freebsd-ipfw@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 540D31065670 for ; Sun, 23 Mar 2008 18:52:16 +0000 (UTC) (envelope-from freebsd-ipfw@m.gmane.org) Received: from ciao.gmane.org (main.gmane.org [80.91.229.2]) by mx1.freebsd.org (Postfix) with ESMTP id 89BE28FC2B for ; Sun, 23 Mar 2008 18:52:15 +0000 (UTC) (envelope-from freebsd-ipfw@m.gmane.org) Received: from list by ciao.gmane.org with local (Exim 4.43) id 1JdVIP-0004g8-AF for freebsd-ipfw@freebsd.org; Sun, 23 Mar 2008 18:51:53 +0000 Received: from 195.208.174.178 ([195.208.174.178]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sun, 23 Mar 2008 18:51:53 +0000 Received: from vadim_nuclight by 195.208.174.178 with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sun, 23 Mar 2008 18:51:53 +0000 X-Injected-Via-Gmane: http://gmane.org/ To: freebsd-ipfw@freebsd.org From: Vadim Goncharov Followup-To: gmane.os.freebsd.devel.ipfw Date: Sun, 23 Mar 2008 18:51:43 +0000 (UTC) Organization: Nuclear Lightning @ Tomsk, TPU AVTF Hostel Lines: 358 Message-ID: X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: 195.208.174.178 X-Comment-To: All User-Agent: slrn/0.9.8.1 (FreeBSD) Sender: news Cc: freebsd-hackers@freebsd.org Subject: [HEADS UP!] IPFW Ideas: possible SoC 2008 candidate X-BeenThere: freebsd-ipfw@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: vadim_nuclight@mail.ru List-Id: IPFW Technical Discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 23 Mar 2008 18:52:16 -0000 Hi! [Sorry if it is too late for SoC, but I was unexpectedly busy last 3 days and couldn't finish this text earlier.] This is a proposal for ipfw improving ideas and architectural changes. Some of them are independent of each other and could be implemented without ABI breaking in STABLE, but, whether all of these will be a SoC 2008 candidate or not, should be finally implemented in FreeBSD. The only question is what should be corrected, so please discuss it :) This text also includes slightly changed and/or generalized ideas from: http://lists.freebsd.org/pipermail/freebsd-ipfw/2007-April/002931.html All syntax examples are only to give idea, this should be discussed. 1. Major changings (ABI breaking is necesary). 1.1. Dynamic rules reorganizing. Description: Current ipfw's dynamic rules are not suitable for several advanced tricks. For example, it is not possible to use saved information about current state of connection in the firewall rules elsewhere, and it is not possible to change that state from firewall also. Wanted features: * Ability to create/delete dynamic rule in any state via some API or ABI from all parts of system: userland, ipfw rules, other kernel modules. This can be useful for: a) Creating dynamic rule in the middle of connection, not only setup: ipfw add pipe 1 ip from any to any tagged 412 keep-state-middle This allows to change handling of connection after some event, e.g. L7 filtering by ng_bpf + ng_tag discovered that a connection belongs to some class by analyzing packet payload, and from now on connection should go directly with dynamic rules, but never sent again to expensive L7 processing. Currently you can use just "keep-state" for this, but ipfw will not see SYN's and rule will be subject to sysctl net.inet.ip.fw.dyn_rst_lifetime - by default expires after 1 second, which is undesirable for many cases. b) Ability to save/load dynamic rules in userland with files, e.g., to continue after reboot. c) Ability to exchange with rules state with other machine with ipfw, e.g., two firewalls in a CARP failover. d) Creation of rule with specified state and parameter before actual connection would be established. E.g. imagine a by-default-closed firewall with a netgraph(4) module analyzing FTP control connection and giving commands to ipfw to open dynamic "holes" for data connections, thus elimanating current practice of opening ports in the entire range 1024-65535 (insecure, yes). One can think about providing direct exchange between libalias(3)'s alias_link and ipfw dynamic rules, but that's a subject for further research. * Additional fields in dynamic rules to keep arbitrary info for specific connection, and opcodes for loading and storing that values from other parts of firewall or elsewhere. This will allow to implement a pf(4)'s "scrub" maximum TTL enforcing on connection, but not only that - generic data storage allows any future extensions. * Ability to change dynamic rule's parent rule "on the fly" (just changing a pointer to which static rule's ACTION_PTR to jump, yes). The latter will allow aforementioned distinguishing of connection packets before/after L7 processing in the case where packets are always classified to flows before any processing takes place - that example with "keep-state-middle" assumed that main firewall is stateless, only L7-matched packets are subject to be dynamic. And this allows to reassign an action for dynamic rule: ... ipfw add 100 check-state ... ipfw add 200 skipto 500 ip from any to any keep-state ... ipfw add 500 netgraph 41 ip from any to any ipfw add 600 change-parent 800 ip from any to any tagged 412 ipfw add 700 allow ip from any to any ipfw add 800 pipe 1 ip from any to any * More types for dynamic rules system would allow not only "keep-state" and "limit", but rather be extensible to something more. E.g., current "limit" rules just drop packets if limit is reached - but user possibly wants an option to process them with another rule afterwards. Possible implementation: * For arbitrary info: add a union of one uint32_t or two uint16_t's or four uint8_t's two each dynamic rules and operations to load/store those values (or may be an uint64_t and two uin32_t's and so on?..). Also add one void* to allow to store more data if one needs to. * Make a special netgraph node (or extend ng_ipfw) which will broadcast every change in dynamic rules to all it's hooks (how many to bundle into one mbuf should be customizable). Every input with structs of the same format will result in addition or deletion of dynamic rules in ipfw. A netgraph node method of work provides flexible and extensible way to manipulate dynamic rules: you can connect to it protocol-trackers which will insert rules for secondary connection (e.g. FTP); you can connect to it userland tool which will log all dynamic rule changing or will do load/save of rules in a file; you can connect to it an ng_ksocket(4) node with UDP to broadcast to someone or TCP to connect to another machine with the same setup to provide CARP failover. Note that node should not do delivery/retransmission checks as pfsync(4) does, because this is a task for someone other (to keep modularity), but two such nodes on different machines connected to each other should provide automatic rules synchronizing without additional actions after initial setup. 1.2. Userland (and other subsystems) interaction, modularity, rulesets. Description: Currently /sbin/ipfw2 is a custom-made parser which communicates with the kernel via setsockopt() calls. It is sometimes hard to extend with new features due to complex code. Using a socket instead a /dev entry means you always need to be root (uid 0) to both read firewall configuration and to change it. In-kernel protocol is also sometimes hard to extend, while some addional entire-ruleset features are useful. Wanted features: * Parser's code (sbin/ipfw2.c) should be reviewed and possibly rewritten using lex(1)/yacc(1). Syntax is ocmplicated, however, and it may be not possible to not implement all of it exactly. This should be further investigated. * It may be desirable to give some other user ability to at least read config and may be to write, as /dev/bpf* permissions allow it for tcpdump(1). * Device entry could also improve modularity: currently to add a new IP_FW_* socket option, you have to modify netinet/raw_ip.c, which means you can't just recompile /sbin/ipfw and ipfw kernel module. * The same applies to other ipfw-related facilities: dummynet, divert, NAT. It can be good to keep them configurable by some other means rather than tweaking raw_ip.c. It can be useful to separate dummynet and divert to it's own facilities to be able to use them without ipfw, e.g., from netgraph(4). Related to this is a problem with IPSEC interaction - if you use it with divert(4) on output, then on return from divert packets will be IPSEC'ed again because in ip_output() IPSEC is called before pfil(9). It could be useful to add an option for user (in addition to existing behaviour, to not break POLA) to call IPSEC processing from specified place in ruleset just like all others: ipfw add ipsec ip from any to any out * As patch about using rule counters is currently discussed in ipfw@, it is useful to add ability to change rule counters to arbitrary values rather than providing the only "zero" action. This is closely related with an option of restore ipfw's static ruleset without losing counter values. Currently you can save "ipfw list" to file, do an "awk '{print "add " $0}'" on it and then load it again (e.g. after reboot). It must be possible to do the same with "ipfw show". Syntax example for providing counters with "ipfw add" - all cases are distinguishable (current syntax allow only first two): ipfw add allow ip from any to any # select next rule number ipfw add 100 allow ip from any to any # exact rule number specified ipfw add 1234 76845 allow ip from any to any # counters without rulenum ipfw add 100 1234 76845 allow ip from any to any # rulenum and counters * Static ruleset loading and saving is closely related with ruleset precompilation and atomic commits. Imagine a rulesets with thousands of rules: if a packet arrives in the middle of ruleset updating, strange effects can occur. Of course, you can achieve the same results with sets, by disabling new set and atomically swapping them later, but that is not always comfortable. Precomplilation of the whole ruleset and then atomically installing it ("transaction commit") requires an implementation which will also allow saving and loading precompiled ruleset in binary form - good for routers where 20K-rules script can be processed for several minutes. * Precompiled binary rules can also be used for the same rule setting from both other kernel subsytems and other machines (CARP again). Thus, generic binary rule format/protocol (not only for /dev) might be invented. Moreover, compiled ruleset format may be different from current linked list, which has disadvantages of both initial "skipto" (and planned "call/return", see next section) and disabled-set-rules are still traversed. Precompiled form of opcodes-only allows to do quick jumps, easy running of cross-rule optimizations (and even possibility to compile ipfw opcodes to machine code like BPF_JITTER for bpf(4) for more speed). This has disadvantages of separate rule counters keeping and not-so-transparent need for user to recompile every time, so should be further investigated. * About several rulesets, for different interfaces (or hacks like per-interface setting of rule number to jump to on it): I think that this is unnecessary and unfriendly to user - having one rulesets is simpler, and you usually need common checks on packets. So "commit" precompiled rules, "call/return" actions (see next section) and stack virtualization via "vimage" should serve all practical purposes. Possible implementation: General view is clear from features description. One also can think about netgraph(4) node for this (again) and/or something like shared memory pages between kernel and userland, to not allocate memory in kernel twice for big rulesets. 2. Independent (minor) changes, which can be possible without ABI breakage. 2.1. call/return rule actions. Description of feature: A "skipto" rule is known as a useful tool to optimize packet flow through ruleset, also able to assign several actions to a dynamic rule (because dynamic rule on match simply jumps to action part of parent rule). But it can only jump forward, not backwards, for the same reason as bpf(4) assember instruction: to prevent infinite loops in packet flow which will cause machine to hang network operations. This can be addressed by introducing a pair of instructions, call and return, which remembers position to return in the stack of some kind. Because return is always done to the next rule after calling one (by number, as with divert/skipto), it is guaranteed that infinite loops can't occur, even in case of calling one rule many times by simply proceeding to next rule after stcak overflow. Thus call/return pair allows to organize some kind of subroutines, with the trick that issuing actual number lets to jump to the middle of subroutine, as in assembly language: ipfw add 100 call 600 ip from any to any in recv $internal ipfw add 100 call 700 ip from any to any in recv $external ... ipfw add 500 allow ip from any to any ipfw add 600 deny ip from any to any not antispoof ipfw add 700 deny tcp from any to any 135,445 ipfw add 900 return // for both those calls It should be noted again that calls are made by rule numbers, so in the following example the first "call 700" will pass control to rule 301, not second rule 300. ipfw add 300 call 700 ... ipfw add 300 call 800 ... ipfw add 301 count ip ... Allowing to use "tablearg" in "call" would be very useful. Parser should allow both version of "return", with some conditions (ususal rule body) and without them (like "check-state"). Possible implementation: Relatively easy. Allocate a mbuf tag for a stack of uint16_t rule numbers and a stack top pointer on first "call" for mbuf. The only thing to care are divert etc. calls, and distinguishing input and output passes (firewall can be called several times for each), thus stack underflow and overflow should be carefully analyzed. May be two tag types, one for input and one for output. It is difficult, however, to get this performing well, because of linked-list nature of ruleset and inability to cache pointer to skip destination, as done with "skipto" currently, because there can be several locations (even tablearg). Possible solutions may be to keep a cache to, say, 256 points in the list (rulenum / 256) to reduce looking after this point (effectively equivalent to hash on rulenum). Or to have compiled rulesets where offset to jump is easily calculated (see previous section). 2.2. Tables and tableargs. Tables are very powerful way to both increase processing speed and conveniently reduce rule maintaing cost for user, especially with tableargs. Tables, however, are currently limited to IPv4 addresses/masks as keys and uint32_t's as values. Table keys should be extended to another data types: IPv6 addresses, interface name strings: ipfw add allow ip from any to table6(1) in recv stringtable(2) or ipfw call tablearg ip from any to any via stringtable(3) The latter will be very handy for routers with e.g. 2000 VLAN or ng* interfaces, with separate client and rules for each. Tableargs should also be expanded to 16 bytes, to be able to store IPv6 address ot uint64_t for checking e.g. in rule counters. It is questionable whether tableargs could also be short (< 16 bytes) strings like interfaces' names. Due to implementation difficulties of distinguishing whether action parameter is a valid value or a tablearg (you usualyy have only one invalid value out ouf 65536 which is get assigned as tablearg indicator), I suggest adding operations like "settablearg" which will set tablearg without actual table used, e.g., from saved arbitrary info from dynamic rules (see section 1.1) or even packet header. So, values for "computed goto" or something like registers still be used by tablearg (just generalizing definition of table), or, at least this should be so in opcode level - user could be present with some other keyword, but I don't see any point in hiding this details. Number of tables of all types should be configurable via sysctl or at least loader tunable rather than current hradcoded number (128). 2.3. Time limit counter. An opcode for a token bucket and/or leaky bucket should be introduced. This will have a one counter changed with timer and other changed by actual packets. We currently have O_LOG opcode looking similar to this, but O_LOG has nothing to deal with timer. Proposed opcode must be useful at least for limiting a number of connections per second, but any other possible use is appreciated, from simplest shaping without dummynet to more exotic like counter "price" coefficinets allowing to build an in-kernel billing solely on ipfw counters. It is questionable where values of counters should be stored, due to locking optimising - directly, as with O_LOG, or separately addressable space like tables. 2.4. Action rules and parameters. Change ACTION_PTR handling in kernel and preparing in compiler to allow actions and their parameters to be placed in any order (except for opcodes where order is required, e.g. prob). This would easily allow placing several opcodes of the same type to action part, e.g.: ipfw add count tag 1 tag 2 tag 3 ip from any to any and using actions and their parameters interchangeably, like having a rule without actual action opcode (only parameter instead), e.g. use "tag" or "altq" as action too (equals to "count"). 2.5. Just to mention: modip, counter limits, fragments. These patches are already currently discussed in ipfw@, but included here just to not forget. These are "modip" action, allowing to modify IP header (DSCP, ToS, TTL) and corresponding match rule options, and a rule option to match when rule counters are less then specified number packets or bytes (possibly from dynamic rule's counters), may be a tablearg. This is also related with mentioned in section 1.2 ability to control rule counters. Adding a few keywords for O_FRAG more fragment matching (not only non-first fragment), e.g. for sending to specialized netgraph(4) reassembling module, is also desirable. That's all for today. Any comments, additions, corrections are welcome! -- WBR, Vadim Goncharov. ICQ#166852181 mailto:vadim_nuclight@mail.ru [Moderator of RU.ANTI-ECOLOGY][FreeBSD][http://antigreen.org][LJ:/nuclight]