From owner-freebsd-ipfw@FreeBSD.ORG Tue Nov 13 19:30:44 2012 Return-Path: Delivered-To: freebsd-ipfw@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 9E148E34; Tue, 13 Nov 2012 19:30:44 +0000 (UTC) (envelope-from melifaro@yandex-team.ru) Received: from forward2.mail.yandex.net (forward2.mail.yandex.net [IPv6:2a02:6b8:0:602::2]) by mx1.freebsd.org (Postfix) with ESMTP id 268358FC17; Tue, 13 Nov 2012 19:30:42 +0000 (UTC) Received: from smtpcorp1.mail.yandex.net (smtpcorp1.mail.yandex.net [77.88.47.195]) by forward2.mail.yandex.net (Yandex) with ESMTP id 4499D12A0F61; Tue, 13 Nov 2012 23:30:40 +0400 (MSK) Received: from smtpcorp1.mail.yandex.net (localhost [127.0.0.1]) by smtpcorp1.mail.yandex.net (Yandex) with ESMTP id 24D83A013B; Tue, 13 Nov 2012 23:30:40 +0400 (MSK) Received: from dhcp170-36-red.yandex.net (dhcp170-36-red.yandex.net [95.108.170.36]) by smtpcorp1.mail.yandex.net (nwsmtp/Yandex) with ESMTP id UdnO2wEU-UenmEQ6J; Tue, 13 Nov 2012 23:30:40 +0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex-team.ru; s=default; t=1352835040; bh=r8ywokDCACklQEe7w7/JgqMt6yepQMyxPhzvwuDEhKQ=; h=Message-ID:Date:From:User-Agent:MIME-Version:To:CC:Subject: Content-Type; b=ah1oS7mHSrB9NpDR9FhwJpOpBv1up2bAUeX8DfoJGAJweIKUV9eBeKiLh107GtrIC PaqBDB9nUGZD2kqJWXQPrUaAn4csjuzCmnhSvzkjoEho0GiN8bQ4+hcePulw2ZflD+ SZbSEveCFv4dU6J+jMyMDoNi6XR/lfCFmrnwHrSQ= Message-ID: <50A29F57.6090701@yandex-team.ru> Date: Tue, 13 Nov 2012 23:28:23 +0400 From: "Alexander V. Chernikov" User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:13.0) Gecko/20120627 Thunderbird/13.0.1 MIME-Version: 1.0 To: freebsd-ipfw@freebsd.org Subject: [CFT] ipfw SMP-ready dynamic states Content-Type: multipart/mixed; boundary="------------000406010204050104020709" Cc: "freebsd-net@freebsd.org" , Luigi Rizzo X-BeenThere: freebsd-ipfw@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: IPFW Technical Discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 13 Nov 2012 19:30:44 -0000 This is a multi-part message in MIME format. --------------000406010204050104020709 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Hello list! Currently most ipfw operations with dynamic states (keep-state, check-state, limit) are serialized via IPFW_DYN_LOCK() which is per-vnet mutex lock. As a result, performance is limited to the same ~650kpps as in routing (in several cases). Patch changes the following: * global lock is changed to per-bucket mutex * state expiration is done in ipfw_tick every 1s. No expiration is done on forwarding path * hash table resize is done automatically and does not cause all states to be lost The only (architectural) problem I see is unlocked V_dyn_count increments. So, we can do the following: 1) lock increments/decrements via some separate mutex 2) do nothing 3) take some combined approach: Generally, we don't need value to be _exact_. As a result, we count total number of states in every ipfw_tick run and set V_dyn_count to new value. New states still increment V_dyn_count unlocked. Performance: Synthetic traffic, ipfw with single allow ip from any to any rule: 2.4M. single keep-state ip from any to any: 2.2M. Some more tests should be taken (with large number of states, different types of traffic, etc), maybe I can do some next week. You need to run recent -current or merge r242631 and r242834 before applying this patch. --------------000406010204050104020709 Content-Type: text/plain; charset=UTF-8; name="ipfw_keepstate.diff" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="ipfw_keepstate.diff" Index: sys/netpfil/ipfw/ip_fw_sockopt.c =================================================================== --- sys/netpfil/ipfw/ip_fw_sockopt.c (revision 242524) +++ sys/netpfil/ipfw/ip_fw_sockopt.c (working copy) @@ -382,7 +382,7 @@ del_entry(struct ip_fw_chain *chain, uint32_t arg) continue; l = RULESIZE(rule); chain->static_len -= l; - ipfw_remove_dyn_children(rule); + ipfw_expire_dyn_rules(chain, rule, RESVD_SET); rule->x_next = chain->reap; chain->reap = rule; } @@ -925,7 +925,7 @@ ipfw_getrules(struct ip_fw_chain *chain, void *buf dst->timestamp += boot_seconds; bp += l; } - ipfw_get_dynamic(&bp, ep); /* protected by the dynamic lock */ + ipfw_get_dynamic(chain, &bp, ep); /* protected by the dynamic lock */ return (bp - (char *)buf); } Index: sys/netpfil/ipfw/ip_fw_private.h =================================================================== --- sys/netpfil/ipfw/ip_fw_private.h (revision 242632) +++ sys/netpfil/ipfw/ip_fw_private.h (working copy) @@ -175,7 +175,9 @@ enum { /* result for matching dynamic rules */ * and only to release the result of lookup_dyn_rule(). * Eventually we may implement it with a callback on the function. */ -void ipfw_dyn_unlock(void); +struct ip_fw_chain; +void ipfw_expire_dyn_rules(struct ip_fw_chain *, struct ip_fw *, int); +void ipfw_dyn_unlock(ipfw_dyn_rule *q); struct tcphdr; struct mbuf *ipfw_send_pkt(struct mbuf *, struct ipfw_flow_id *, @@ -185,11 +187,11 @@ int ipfw_install_state(struct ip_fw *rule, ipfw_in ipfw_dyn_rule *ipfw_lookup_dyn_rule(struct ipfw_flow_id *pkt, int *match_direction, struct tcphdr *tcp); void ipfw_remove_dyn_children(struct ip_fw *rule); -void ipfw_get_dynamic(char **bp, const char *ep); +void ipfw_get_dynamic(struct ip_fw_chain *chain, char **bp, const char *ep); void ipfw_dyn_attach(void); /* uma_zcreate .... */ void ipfw_dyn_detach(void); /* uma_zdestroy ... */ -void ipfw_dyn_init(void); /* per-vnet initialization */ +void ipfw_dyn_init(struct ip_fw_chain *); /* per-vnet initialization */ void ipfw_dyn_uninit(int); /* per-vnet deinitialization */ int ipfw_dyn_len(void); @@ -259,6 +261,10 @@ struct sockopt; /* used by tcp_var.h */ #define IPFW_WLOCK(p) rw_wlock(&(p)->rwmtx) #define IPFW_WUNLOCK(p) rw_wunlock(&(p)->rwmtx) +#define IPFW_UH_LOCK_ASSERT(_chain) rw_assert(&(_chain)->uh_lock, RA_LOCKED) +#define IPFW_UH_RLOCK_ASSERT(_chain) rw_assert(&(_chain)->uh_lock, RA_RLOCKED) +#define IPFW_UH_WLOCK_ASSERT(_chain) rw_assert(&(_chain)->uh_lock, RA_WLOCKED) + #define IPFW_UH_RLOCK(p) rw_rlock(&(p)->uh_lock) #define IPFW_UH_RUNLOCK(p) rw_runlock(&(p)->uh_lock) #define IPFW_UH_WLOCK(p) rw_wlock(&(p)->uh_lock) Index: sys/netpfil/ipfw/ip_fw_dynamic.c =================================================================== --- sys/netpfil/ipfw/ip_fw_dynamic.c (revision 242834) +++ sys/netpfil/ipfw/ip_fw_dynamic.c (working copy) @@ -111,38 +111,33 @@ __FBSDID("$FreeBSD$"); * passes through the firewall. XXX check the latter!!! */ +struct ipfw_dyn_bucket { + struct mtx mtx; /* Bucket protecting lock */ + ipfw_dyn_rule *head; /* Pointer to first rule */ +}; + /* * Static variables followed by global ones */ -static VNET_DEFINE(ipfw_dyn_rule **, ipfw_dyn_v); -static VNET_DEFINE(u_int32_t, dyn_buckets); +static VNET_DEFINE(struct ipfw_dyn_bucket *, ipfw_dyn_v); +static VNET_DEFINE(u_int32_t, dyn_buckets_max); static VNET_DEFINE(u_int32_t, curr_dyn_buckets); static VNET_DEFINE(struct callout, ipfw_timeout); #define V_ipfw_dyn_v VNET(ipfw_dyn_v) -#define V_dyn_buckets VNET(dyn_buckets) +#define V_dyn_buckets_max VNET(dyn_buckets_max) #define V_curr_dyn_buckets VNET(curr_dyn_buckets) #define V_ipfw_timeout VNET(ipfw_timeout) static uma_zone_t ipfw_dyn_rule_zone; -#ifndef __FreeBSD__ -DEFINE_SPINLOCK(ipfw_dyn_mtx); -#else -static struct mtx ipfw_dyn_mtx; /* mutex guarding dynamic rules */ -#endif -#define IPFW_DYN_LOCK_INIT() \ - mtx_init(&ipfw_dyn_mtx, "IPFW dynamic rules", NULL, MTX_DEF) -#define IPFW_DYN_LOCK_DESTROY() mtx_destroy(&ipfw_dyn_mtx) -#define IPFW_DYN_LOCK() mtx_lock(&ipfw_dyn_mtx) -#define IPFW_DYN_UNLOCK() mtx_unlock(&ipfw_dyn_mtx) -#define IPFW_DYN_LOCK_ASSERT() mtx_assert(&ipfw_dyn_mtx, MA_OWNED) +#define IPFW_BUCK_LOCK_INIT(b) \ + mtx_init(&(b)->mtx, "IPFW dynamic bucket", NULL, MTX_DEF) +#define IPFW_BUCK_LOCK_DESTROY(b) \ + mtx_destroy(&(b)->mtx) +#define IPFW_BUCK_LOCK(i) mtx_lock(&V_ipfw_dyn_v[(i)].mtx) +#define IPFW_BUCK_UNLOCK(i) mtx_unlock(&V_ipfw_dyn_v[(i)].mtx) +#define IPFW_BUCK_ASSERT(i) mtx_assert(&V_ipfw_dyn_v[(i)].mtx, MA_OWNED) -void -ipfw_dyn_unlock(void) -{ - IPFW_DYN_UNLOCK(); -} - /* * Timeouts for various events in handing dynamic rules. */ @@ -171,10 +166,12 @@ static VNET_DEFINE(u_int32_t, dyn_short_lifetime); static VNET_DEFINE(u_int32_t, dyn_keepalive_interval); static VNET_DEFINE(u_int32_t, dyn_keepalive_period); static VNET_DEFINE(u_int32_t, dyn_keepalive); +static VNET_DEFINE(time_t, dyn_keepalive_last); #define V_dyn_keepalive_interval VNET(dyn_keepalive_interval) #define V_dyn_keepalive_period VNET(dyn_keepalive_period) #define V_dyn_keepalive VNET(dyn_keepalive) +#define V_dyn_keepalive_last VNET(dyn_keepalive_last) static VNET_DEFINE(u_int32_t, dyn_count); /* # of dynamic rules */ static VNET_DEFINE(u_int32_t, dyn_max); /* max # of dynamic rules */ @@ -182,14 +179,17 @@ static VNET_DEFINE(u_int32_t, dyn_max); /* max # #define V_dyn_count VNET(dyn_count) #define V_dyn_max VNET(dyn_max) +static void ipfw_dyn_tick(void *vnetx); +static void check_dyn_rules(struct ip_fw_chain *, struct ip_fw *, + int, int, int); #ifdef SYSCTL_NODE SYSBEGIN(f2) SYSCTL_DECL(_net_inet_ip_fw); SYSCTL_VNET_UINT(_net_inet_ip_fw, OID_AUTO, dyn_buckets, - CTLFLAG_RW, &VNET_NAME(dyn_buckets), 0, - "Number of dyn. buckets"); + CTLFLAG_RW, &VNET_NAME(dyn_buckets_max), 0, + "Max number of dyn. buckets"); SYSCTL_VNET_UINT(_net_inet_ip_fw, OID_AUTO, curr_dyn_buckets, CTLFLAG_RD, &VNET_NAME(curr_dyn_buckets), 0, "Current Number of dyn. buckets"); @@ -244,7 +244,7 @@ hash_packet6(struct ipfw_flow_id *id) * and we want to find both in the same bucket. */ static __inline int -hash_packet(struct ipfw_flow_id *id) +hash_packet(struct ipfw_flow_id *id, int buckets) { u_int32_t i; @@ -254,7 +254,7 @@ static __inline int else #endif /* INET6 */ i = (id->dst_ip) ^ (id->src_ip) ^ (id->dst_port) ^ (id->src_port); - i &= (V_curr_dyn_buckets - 1); + i &= (buckets - 1); return i; } @@ -292,118 +292,13 @@ print_dyn_rule_flags(struct ipfw_flow_id *id, int #define print_dyn_rule(id, dtype, prefix, postfix) \ print_dyn_rule_flags(id, dtype, LOG_DEBUG, prefix, postfix) -/** - * unlink a dynamic rule from a chain. prev is a pointer to - * the previous one, q is a pointer to the rule to delete, - * head is a pointer to the head of the queue. - * Modifies q and potentially also head. - */ -#define UNLINK_DYN_RULE(prev, head, q) { \ - ipfw_dyn_rule *old_q = q; \ - \ - /* remove a refcount to the parent */ \ - if (q->dyn_type == O_LIMIT) \ - q->parent->count--; \ - V_dyn_count--; \ - DEB(print_dyn_rule(&q->id, q->dyn_type, "unlink entry", "left");) \ - if (prev != NULL) \ - prev->next = q = q->next; \ - else \ - head = q = q->next; \ - uma_zfree(ipfw_dyn_rule_zone, old_q); } - #define TIME_LEQ(a,b) ((int)((a)-(b)) <= 0) -/** - * Remove dynamic rules pointing to "rule", or all of them if rule == NULL. - * - * If keep_me == NULL, rules are deleted even if not expired, - * otherwise only expired rules are removed. - * - * The value of the second parameter is also used to point to identify - * a rule we absolutely do not want to remove (e.g. because we are - * holding a reference to it -- this is the case with O_LIMIT_PARENT - * rules). The pointer is only used for comparison, so any non-null - * value will do. - */ -static void -remove_dyn_rule(struct ip_fw *rule, ipfw_dyn_rule *keep_me) -{ - static u_int32_t last_remove = 0; - -#define FORCE (keep_me == NULL) - - ipfw_dyn_rule *prev, *q; - int i, pass = 0, max_pass = 0; - - IPFW_DYN_LOCK_ASSERT(); - - if (V_ipfw_dyn_v == NULL || V_dyn_count == 0) - return; - /* do not expire more than once per second, it is useless */ - if (!FORCE && last_remove == time_uptime) - return; - last_remove = time_uptime; - - /* - * because O_LIMIT refer to parent rules, during the first pass only - * remove child and mark any pending LIMIT_PARENT, and remove - * them in a second pass. - */ -next_pass: - for (i = 0 ; i < V_curr_dyn_buckets ; i++) { - for (prev=NULL, q = V_ipfw_dyn_v[i] ; q ; ) { - /* - * Logic can become complex here, so we split tests. - */ - if (q == keep_me) - goto next; - if (rule != NULL && rule != q->rule) - goto next; /* not the one we are looking for */ - if (q->dyn_type == O_LIMIT_PARENT) { - /* - * handle parent in the second pass, - * record we need one. - */ - max_pass = 1; - if (pass == 0) - goto next; - if (FORCE && q->count != 0 ) { - /* XXX should not happen! */ - printf("ipfw: OUCH! cannot remove rule," - " count %d\n", q->count); - } - } else { - if (!FORCE && - !TIME_LEQ( q->expire, time_uptime )) - goto next; - } - if (q->dyn_type != O_LIMIT_PARENT || !q->count) { - UNLINK_DYN_RULE(prev, V_ipfw_dyn_v[i], q); - continue; - } -next: - prev=q; - q=q->next; - } - } - if (pass++ < max_pass) - goto next_pass; -} - -void -ipfw_remove_dyn_children(struct ip_fw *rule) -{ - IPFW_DYN_LOCK(); - remove_dyn_rule(rule, NULL /* force removal */); - IPFW_DYN_UNLOCK(); -} - /* - * Lookup a dynamic rule, locked version. + * Lookup a dynamic rule */ static ipfw_dyn_rule * -lookup_dyn_rule_locked(struct ipfw_flow_id *pkt, int *match_direction, +lookup_dyn_rule_locked(struct ipfw_flow_id *pkt, int i, int *match_direction, struct tcphdr *tcp) { /* @@ -414,23 +309,17 @@ static ipfw_dyn_rule * #define MATCH_FORWARD 1 #define MATCH_NONE 2 #define MATCH_UNKNOWN 3 - int i, dir = MATCH_NONE; + int dir = MATCH_NONE; ipfw_dyn_rule *prev, *q = NULL; - IPFW_DYN_LOCK_ASSERT(); + IPFW_BUCK_ASSERT(i); - if (V_ipfw_dyn_v == NULL) - goto done; /* not found */ - i = hash_packet(pkt); - for (prev = NULL, q = V_ipfw_dyn_v[i]; q != NULL;) { + for (prev = NULL, q = V_ipfw_dyn_v[i].head; q; prev = q, q = q->next) { if (q->dyn_type == O_LIMIT_PARENT && q->count) - goto next; - if (TIME_LEQ(q->expire, time_uptime)) { /* expire entry */ - UNLINK_DYN_RULE(prev, V_ipfw_dyn_v[i], q); continue; - } + if (pkt->proto != q->id.proto || q->dyn_type == O_LIMIT_PARENT) - goto next; + continue; if (IS_IP6_FLOW_ID(pkt)) { if (IN6_ARE_ADDR_EQUAL(&pkt->src_ip6, &q->id.src_ip6) && @@ -463,17 +352,14 @@ static ipfw_dyn_rule * break; } } -next: - prev = q; - q = q->next; } if (q == NULL) goto done; /* q = NULL, not found */ if (prev != NULL) { /* found and not in front */ prev->next = q->next; - q->next = V_ipfw_dyn_v[i]; - V_ipfw_dyn_v[i] = q; + q->next = V_ipfw_dyn_v[i].head; + V_ipfw_dyn_v[i].head = q; } if (pkt->proto == IPPROTO_TCP) { /* update state according to flags */ uint32_t ack; @@ -556,44 +442,123 @@ ipfw_lookup_dyn_rule(struct ipfw_flow_id *pkt, int struct tcphdr *tcp) { ipfw_dyn_rule *q; + int i; - IPFW_DYN_LOCK(); - q = lookup_dyn_rule_locked(pkt, match_direction, tcp); + i = hash_packet(pkt, V_curr_dyn_buckets); + + IPFW_BUCK_LOCK(i); + q = lookup_dyn_rule_locked(pkt, i, match_direction, tcp); if (q == NULL) - IPFW_DYN_UNLOCK(); + IPFW_BUCK_UNLOCK(i); /* NB: return table locked when q is not NULL */ return q; } -static void -realloc_dynamic_table(void) +/* + * Unlock bucket mtx + * @p - pointer to dynamic rule + */ +void +ipfw_dyn_unlock(ipfw_dyn_rule *q) { - IPFW_DYN_LOCK_ASSERT(); + IPFW_BUCK_UNLOCK(q->bucket); +} +static int +resize_dynamic_table(struct ip_fw_chain *chain, int nbuckets) +{ + int i, k, nbuckets_old; + ipfw_dyn_rule *q; + struct ipfw_dyn_bucket *dyn_v, *dyn_v_old; + + /* Check if given number is power of 2 and less than 64k */ + if (nbuckets > 65536) + return 1; + + if ((nbuckets & (nbuckets - 1)) != 0) + return -1; + + CTR3(KTR_NET, "%s: resize dynamic hash: %d -> %d", __func__, + V_curr_dyn_buckets, nbuckets); + + /* Allocate and initialize new hash */ + dyn_v = malloc(nbuckets * sizeof(ipfw_dyn_rule), M_IPFW, + M_WAITOK | M_ZERO); + + for (i = 0 ; i < nbuckets; i++) + IPFW_BUCK_LOCK_INIT(&dyn_v[i]); + /* - * Try reallocation, make sure we have a power of 2 and do - * not allow more than 64k entries. In case of overflow, - * default to 1024. + * Call upper half lock, as get_map() do to ease + * read-only access to dynamic rules hash from sysctl */ + IPFW_UH_WLOCK(chain); - if (V_dyn_buckets > 65536) - V_dyn_buckets = 1024; - if ((V_dyn_buckets & (V_dyn_buckets-1)) != 0) { /* not a power of 2 */ - V_dyn_buckets = V_curr_dyn_buckets; /* reset */ - return; + /* Acquire chain write lock to permit hash access + * for main traffic path without additional locks + */ + IPFW_WLOCK(chain); + + /* Save old values */ + nbuckets_old = V_curr_dyn_buckets; + dyn_v_old = V_ipfw_dyn_v; + + /* Skip relinking if array is not set up */ + if (V_ipfw_dyn_v == NULL) + V_curr_dyn_buckets = 0; + + /* Re-link all dynamic states */ + for (i = 0 ; i < V_curr_dyn_buckets ; i++) { + while (V_ipfw_dyn_v[i].head != NULL) { + /* Remove from current chain */ + q = V_ipfw_dyn_v[i].head; + V_ipfw_dyn_v[i].head = q->next; + + /* Get new hash value */ + k = hash_packet(&q->id, nbuckets); + q->bucket = k; + /* Add to the new head */ + q->next = dyn_v[k].head; + dyn_v[k].head = q; + } } - V_curr_dyn_buckets = V_dyn_buckets; - if (V_ipfw_dyn_v != NULL) - free(V_ipfw_dyn_v, M_IPFW); - for (;;) { - V_ipfw_dyn_v = malloc(V_curr_dyn_buckets * sizeof(ipfw_dyn_rule *), - M_IPFW, M_NOWAIT | M_ZERO); - if (V_ipfw_dyn_v != NULL || V_curr_dyn_buckets <= 2) - break; - V_curr_dyn_buckets /= 2; + + /* Update current pointers/buckets values */ + V_curr_dyn_buckets = nbuckets; + V_ipfw_dyn_v = dyn_v; + + IPFW_WUNLOCK(chain); + + IPFW_UH_WUNLOCK(chain); + + /* Start periodic callout on initial creation */ + if (dyn_v_old == NULL) { + callout_reset_on(&V_ipfw_timeout, hz, ipfw_dyn_tick, curvnet, 0); + return (0); } + + /* Destroy all mutexes */ + for (i = 0 ; i < nbuckets_old ; i++) + IPFW_BUCK_LOCK_DESTROY(&dyn_v_old[i]); + + /* Free old hash */ + free(dyn_v_old, M_IPFW); + + return 0; } +#if 0 +void +ipfw_prepare_dynamic(struct ip_fw_chain *chain) +{ + + if (V_ipfw_dyn_v != NULL) + return; + + resize_dynamic_table(chain, V_curr_dyn_buckets); +} +#endif + /** * Install state of type 'type' for a dynamic session. * The hash table contains two type of rules: @@ -605,33 +570,26 @@ ipfw_lookup_dyn_rule(struct ipfw_flow_id *pkt, int * - "parent" rules for the above (O_LIMIT_PARENT). */ static ipfw_dyn_rule * -add_dyn_rule(struct ipfw_flow_id *id, u_int8_t dyn_type, struct ip_fw *rule) +add_dyn_rule(struct ipfw_flow_id *id, int i, u_int8_t dyn_type, struct ip_fw *rule) { ipfw_dyn_rule *r; - int i; - IPFW_DYN_LOCK_ASSERT(); + IPFW_BUCK_ASSERT(i); - if (V_ipfw_dyn_v == NULL || - (V_dyn_count == 0 && V_dyn_buckets != V_curr_dyn_buckets)) { - realloc_dynamic_table(); - if (V_ipfw_dyn_v == NULL) - return NULL; /* failed ! */ - } - i = hash_packet(id); - r = uma_zalloc(ipfw_dyn_rule_zone, M_NOWAIT | M_ZERO); if (r == NULL) { printf ("ipfw: sorry cannot allocate state\n"); return NULL; } - /* increase refcount on parent, and set pointer */ + /* + * refcount on parent is already incremented, so + * it is safe to use parent unlocked. + */ if (dyn_type == O_LIMIT) { ipfw_dyn_rule *parent = (ipfw_dyn_rule *)rule; if ( parent->dyn_type != O_LIMIT_PARENT) panic("invalid parent"); - parent->count++; r->parent = parent; rule = parent->rule; } @@ -644,8 +602,8 @@ static ipfw_dyn_rule * r->count = 0; r->bucket = i; - r->next = V_ipfw_dyn_v[i]; - V_ipfw_dyn_v[i] = r; + r->next = V_ipfw_dyn_v[i].head; + V_ipfw_dyn_v[i].head = r; V_dyn_count++; DEB(print_dyn_rule(id, dyn_type, "add dyn entry", "total");) return r; @@ -656,40 +614,40 @@ static ipfw_dyn_rule * * If the lookup fails, then install one. */ static ipfw_dyn_rule * -lookup_dyn_parent(struct ipfw_flow_id *pkt, struct ip_fw *rule) +lookup_dyn_parent(struct ipfw_flow_id *pkt, int *pindex, struct ip_fw *rule) { ipfw_dyn_rule *q; - int i; + int i, is_v6; - IPFW_DYN_LOCK_ASSERT(); + is_v6 = IS_IP6_FLOW_ID(pkt); + i = hash_packet( pkt, V_curr_dyn_buckets ); + *pindex = i; + IPFW_BUCK_LOCK(i); + for (q = V_ipfw_dyn_v[i].head ; q != NULL ; q=q->next) + if (q->dyn_type == O_LIMIT_PARENT && + rule== q->rule && + pkt->proto == q->id.proto && + pkt->src_port == q->id.src_port && + pkt->dst_port == q->id.dst_port && + ( + (is_v6 && + IN6_ARE_ADDR_EQUAL(&(pkt->src_ip6), + &(q->id.src_ip6)) && + IN6_ARE_ADDR_EQUAL(&(pkt->dst_ip6), + &(q->id.dst_ip6))) || + (!is_v6 && + pkt->src_ip == q->id.src_ip && + pkt->dst_ip == q->id.dst_ip) + ) + ) { + q->expire = time_uptime + V_dyn_short_lifetime; + DEB(print_dyn_rule(pkt, q->dyn_type, + "lookup_dyn_parent found", "");) + return q; + } - if (V_ipfw_dyn_v) { - int is_v6 = IS_IP6_FLOW_ID(pkt); - i = hash_packet( pkt ); - for (q = V_ipfw_dyn_v[i] ; q != NULL ; q=q->next) - if (q->dyn_type == O_LIMIT_PARENT && - rule== q->rule && - pkt->proto == q->id.proto && - pkt->src_port == q->id.src_port && - pkt->dst_port == q->id.dst_port && - ( - (is_v6 && - IN6_ARE_ADDR_EQUAL(&(pkt->src_ip6), - &(q->id.src_ip6)) && - IN6_ARE_ADDR_EQUAL(&(pkt->dst_ip6), - &(q->id.dst_ip6))) || - (!is_v6 && - pkt->src_ip == q->id.src_ip && - pkt->dst_ip == q->id.dst_ip) - ) - ) { - q->expire = time_uptime + V_dyn_short_lifetime; - DEB(print_dyn_rule(pkt, q->dyn_type, - "lookup_dyn_parent found", "");) - return q; - } - } - return add_dyn_rule(pkt, O_LIMIT_PARENT, rule); + /* Add virtual limiting rule */ + return add_dyn_rule(pkt, i, O_LIMIT_PARENT, rule); } /** @@ -704,12 +662,15 @@ ipfw_install_state(struct ip_fw *rule, ipfw_insn_l { static int last_log; ipfw_dyn_rule *q; + int i; DEB(print_dyn_rule(&args->f_id, cmd->o.opcode, "install_state", "");) + + i = hash_packet(&args->f_id, V_curr_dyn_buckets); - IPFW_DYN_LOCK(); + IPFW_BUCK_LOCK(i); - q = lookup_dyn_rule_locked(&args->f_id, NULL, NULL); + q = lookup_dyn_rule_locked(&args->f_id, i, NULL, NULL); if (q != NULL) { /* should never occur */ DEB( @@ -718,26 +679,22 @@ ipfw_install_state(struct ip_fw *rule, ipfw_insn_l printf("ipfw: %s: entry already present, done\n", __func__); }) - IPFW_DYN_UNLOCK(); + IPFW_BUCK_UNLOCK(i); return (0); } - if (V_dyn_count >= V_dyn_max) - /* Run out of slots, try to remove any expired rule. */ - remove_dyn_rule(NULL, (ipfw_dyn_rule *)1); - if (V_dyn_count >= V_dyn_max) { if (last_log != time_uptime) { last_log = time_uptime; printf("ipfw: %s: Too many dynamic rules\n", __func__); } - IPFW_DYN_UNLOCK(); + IPFW_BUCK_UNLOCK(i); return (1); /* cannot install, notify caller */ } switch (cmd->o.opcode) { case O_KEEP_STATE: /* bidir rule */ - add_dyn_rule(&args->f_id, O_KEEP_STATE, rule); + add_dyn_rule(&args->f_id, i, O_KEEP_STATE, rule); break; case O_LIMIT: { /* limit number of sessions */ @@ -745,6 +702,7 @@ ipfw_install_state(struct ip_fw *rule, ipfw_insn_l ipfw_dyn_rule *parent; uint32_t conn_limit; uint16_t limit_mask = cmd->limit_mask; + int pindex; conn_limit = (cmd->conn_limit == IP_FW_TABLEARG) ? tablearg : cmd->conn_limit; @@ -778,46 +736,54 @@ ipfw_install_state(struct ip_fw *rule, ipfw_insn_l id.src_port = args->f_id.src_port; if (limit_mask & DYN_DST_PORT) id.dst_port = args->f_id.dst_port; - if ((parent = lookup_dyn_parent(&id, rule)) == NULL) { + + /* + * We have to release lock for previous bucket to + * avoid possible deadlock + */ + IPFW_BUCK_UNLOCK(i); + + if ((parent = lookup_dyn_parent(&id, &pindex, rule)) == NULL) { printf("ipfw: %s: add parent failed\n", __func__); - IPFW_DYN_UNLOCK(); + IPFW_BUCK_UNLOCK(pindex); return (1); } if (parent->count >= conn_limit) { - /* See if we can remove some expired rule. */ - remove_dyn_rule(rule, parent); - if (parent->count >= conn_limit) { - if (V_fw_verbose && last_log != time_uptime) { - last_log = time_uptime; - char sbuf[24]; - last_log = time_uptime; - snprintf(sbuf, sizeof(sbuf), - "%d drop session", - parent->rule->rulenum); - print_dyn_rule_flags(&args->f_id, - cmd->o.opcode, - LOG_SECURITY | LOG_DEBUG, - sbuf, "too many entries"); - } - IPFW_DYN_UNLOCK(); - return (1); + if (V_fw_verbose && last_log != time_uptime) { + last_log = time_uptime; + char sbuf[24]; + last_log = time_uptime; + snprintf(sbuf, sizeof(sbuf), + "%d drop session", + parent->rule->rulenum); + print_dyn_rule_flags(&args->f_id, + cmd->o.opcode, + LOG_SECURITY | LOG_DEBUG, + sbuf, "too many entries"); } + IPFW_BUCK_UNLOCK(pindex); + return (1); } - add_dyn_rule(&args->f_id, O_LIMIT, (struct ip_fw *)parent); + /* Increment counter on parent */ + parent->count++; + IPFW_BUCK_UNLOCK(pindex); + + IPFW_BUCK_LOCK(i); + add_dyn_rule(&args->f_id, i, O_LIMIT, (struct ip_fw *)parent); break; } default: printf("ipfw: %s: unknown dynamic rule type %u\n", __func__, cmd->o.opcode); - IPFW_DYN_UNLOCK(); + IPFW_BUCK_UNLOCK(i); return (1); } /* XXX just set lifetime */ - lookup_dyn_rule_locked(&args->f_id, NULL, NULL); + lookup_dyn_rule_locked(&args->f_id, i, NULL, NULL); - IPFW_DYN_UNLOCK(); + IPFW_BUCK_UNLOCK(i); return (0); } @@ -996,24 +962,87 @@ ipfw_dyn_send_ka(struct mbuf **mtailp, ipfw_dyn_ru } /* - * This procedure is only used to handle keepalives. It is invoked - * every dyn_keepalive_period + * This procedure is used to perform various maintance + * on dynamic hash list. Currently it is called every second. */ static void -ipfw_tick(void * vnetx) +ipfw_dyn_tick(void * vnetx) { - struct mbuf *m0, *m, *mnext, **mtailp; - struct ip *h; - int i; - ipfw_dyn_rule *q; + struct ip_fw_chain *chain; + int check_ka = 0; #ifdef VIMAGE struct vnet *vp = vnetx; #endif CURVNET_SET(vp); - if (V_dyn_keepalive == 0 || V_ipfw_dyn_v == NULL || V_dyn_count == 0) - goto done; + chain = &V_layer3_chain; + + /* Run keepalive checks every keepalive_interval iff ka is enabled */ + if ((V_dyn_keepalive_last + V_dyn_keepalive_interval >= time_uptime) && + (V_dyn_keepalive != 0)) { + V_dyn_keepalive_last = time_uptime; + check_ka = 1; + } + + check_dyn_rules(chain, NULL, RESVD_SET, check_ka, 1); + + callout_reset_on(&V_ipfw_timeout, hz, ipfw_dyn_tick, vnetx, 0); + + CURVNET_RESTORE(); +} + + +/* + * Walk thru all dynamic states doing generic maintance: + * 1) free expired states + * 2) free all states based on deleted rule / set + * 3) send keepalives for states if needed + * + * @chain - pointer to current ipfw rules chain + * @rule - delete all states originated by given rule if != NULL + * @set - delete all states originated by any rule in set @set if != RESVD_SET + * @check_ka - perform checking/sending keepalives + * @timer - indicate call from timer routine. + * + * Timer routine must call this function unlocked to permit + * sending keepalives/resizing table. + * + * Others has to call function with IPFW_UH_WLOCK held. + * + * Write lock is needed to ensure that unused parent rules + * are not freed by other instance (see stage 2, 3) + */ +static void +check_dyn_rules(struct ip_fw_chain *chain, struct ip_fw *rule, + int set, int check_ka, int timer) +{ + struct mbuf *m0, *m, *mnext, **mtailp; + struct ip *h; + int i, new_buckets = 0, max_buckets; + int expired = 0, expired_limits = 0, parents = 0, total = 0; + ipfw_dyn_rule *q, *q_prev, *q_next; + ipfw_dyn_rule *exp_head, **exptailp; + ipfw_dyn_rule *exp_lhead, **expltailp; + + KASSERT(V_ipfw_dyn_v != NULL, ("%s: dynamic table not allocated", + __func__)); + + /* Avoid possible LOR */ + KASSERT(!check_ka || timer, ("%s: keepalive check with lock held", + __func__)); + + if (V_dyn_count == 0) + return; + + /* Expired states */ + exp_head = NULL; + exptailp = &exp_head; + + /* Expired limit states */ + exp_lhead = NULL; + expltailp = &exp_lhead; + /* * We make a chain of packets to go out here -- not deferring * until after we drop the IPFW dynamic rule lock would result @@ -1022,27 +1051,202 @@ static void */ m0 = NULL; mtailp = &m0; - IPFW_DYN_LOCK(); + + /* Protect from hash resizing */ + if (timer != 0) + IPFW_UH_WLOCK(chain); + else { + IPFW_UH_WLOCK_ASSERT(chain); + } + +#define NEXT_RULE() { q_prev = q; q = q->next ; continue; } + + /* Stage 1: perform requested deletion */ for (i = 0 ; i < V_curr_dyn_buckets ; i++) { - for (q = V_ipfw_dyn_v[i] ; q ; q = q->next ) { - if (q->dyn_type == O_LIMIT_PARENT) - continue; - if (TIME_LEQ(q->expire, time_uptime)) - continue; /* too late, rule expired */ + IPFW_BUCK_LOCK(i); + for (q = V_ipfw_dyn_v[i].head, q_prev = q ; q ; ) { + /* account every rule */ + total++; - if (q->id.proto != IPPROTO_TCP) + /* Skip parent rules at all */ + if (q->dyn_type == O_LIMIT_PARENT) { + parents++; + NEXT_RULE(); + } + + /* + * Remove rules which are: + * 1) expired + * 2) created by given rule + * 3) created by any rule in given set + */ + if ((TIME_LEQ(q->expire, time_uptime)) || + ((rule != NULL) && (q->rule == rule)) || + ((set != RESVD_SET) && (q->rule->set == set))) { + /* Unlink q from current list */ + if (q == V_ipfw_dyn_v[i].head) + V_ipfw_dyn_v[i].head = q->next; + else + q_prev->next = q->next; + q->next = NULL; + + /* queue q to expire list */ + if (q->dyn_type != O_LIMIT) { + *exptailp = q; + exptailp = &(*exptailp)->next; + DEB(print_dyn_rule(&q->id, q->dyn_type, + "unlink entry", "left"); + ) + } else { + /* Separate list for limit rules */ + *expltailp = q; + expltailp = &(*expltailp)->next; + expired_limits++; + DEB(print_dyn_rule(&q->id, q->dyn_type, + "unlink limit entry", "left"); + ) + } + + q = q_prev->next; + expired++; continue; - if ( (q->state & BOTH_SYN) != BOTH_SYN) - continue; - if (TIME_LEQ(time_uptime + V_dyn_keepalive_interval, - q->expire)) - continue; /* too early */ + } - mtailp = ipfw_dyn_send_ka(mtailp, q); + /* + * Check if we need to send keepalive: + * we need to ensure if is time to do KA, + * this is established TCP session, and + * expire time is within keepalive interval + */ + if ((check_ka != 0) && (q->id.proto == IPPROTO_TCP) && + ((q->state & BOTH_SYN) == BOTH_SYN) && + (TIME_LEQ(q->expire, time_uptime + + V_dyn_keepalive_interval))) + mtailp = ipfw_dyn_send_ka(mtailp, q); + + NEXT_RULE(); } + IPFW_BUCK_UNLOCK(i); } - IPFW_DYN_UNLOCK(); + /* Stage 2: decrement counters from O_LIMIT parents */ + if (expired_limits != 0) { + /* + * XXX: Note that deleting set with more than one + * heavily-used LIMIT rules can result in overwhelming + * locking due to lack of per-hash value sorting + * + * We should probably think about: + * 1) pre-allocating hash of size, say, + * MAX(16, V_curr_dyn_buckets / 1024) + * 2) checking if expired_limits is large enough + * 3) If yes, init hash (or its part), re-link + * current list and start decrementing procedure in + * each bucket separately + */ + + /* + * Small optimization: do not unlock bucket until + * we see the next item resides in different bucket + */ + if (exp_lhead != NULL) { + i = exp_lhead->parent->bucket; + IPFW_BUCK_LOCK(i); + } + for (q = exp_lhead; q != NULL; q = q->next) { + if (i != q->parent->bucket) { + IPFW_BUCK_UNLOCK(i); + i = q->parent->bucket; + IPFW_BUCK_LOCK(i); + } + + /* Decrease parent refcount */ + q->parent->count--; + } + if (exp_lhead != NULL) + IPFW_BUCK_UNLOCK(i); + } + + /* + * We protectet ourselves from unused parent deletion by + * holding UH write lock. + */ + + /* Stage 3: remove unused parent rules */ + if ((parents != 0) && (expired != 0)) { + for (i = 0 ; i < V_curr_dyn_buckets ; i++) { + IPFW_BUCK_LOCK(i); + for (q = V_ipfw_dyn_v[i].head, q_prev = q ; q ; ) { + if (q->dyn_type != O_LIMIT_PARENT) + NEXT_RULE(); + + if (q->count != 0) + NEXT_RULE(); + + /* Parent rule without consumers */ + *exptailp = q; + exptailp = &(*exptailp)->next; + + DEB(print_dyn_rule(&q->id, q->dyn_type, + "unlink parent entry", "left"); + ) + + expired++; + + q = q->next; + } + IPFW_BUCK_UNLOCK(i); + } + } + +#undef NEXT_RULE + + /* + * Update total rules count. + * This can be slightly incorrect since we lock/unlock + * every bucket lock sequentally. + * + * However, this is good and regularly updated estimation + * for the total rules count. + */ + V_dyn_count = total - expired; + + /* + * Check if we need to resize hash: + * if current number of states exceeds number of buckes in hash, + * grow hash size to the minimum power of 2 which is bigger than + * current states count. Limit hash size by 64k. + */ + max_buckets = (V_dyn_buckets_max > 65536) ? 65536 : V_dyn_buckets_max; + + if (V_dyn_count > V_curr_dyn_buckets * 2) { + new_buckets = V_curr_dyn_buckets; + while (new_buckets < V_dyn_count) { + new_buckets *= 2; + + if (new_buckets >= max_buckets) + break; + } + } + + if (timer != 0) + IPFW_UH_WUNLOCK(chain); + + /* Finally delete old states ad limits if any */ + for (q = exp_head; q != NULL; q = q_next) { + q_next = q->next; + uma_zfree(ipfw_dyn_rule_zone, q); + } + + for (q = exp_lhead; q != NULL; q = q_next) { + q_next = q->next; + uma_zfree(ipfw_dyn_rule_zone, q); + } + + /* The rest code should be called from timer routine only */ + if (timer == 0) + return; + /* Send keepalive packets if any */ for (m = m0; m != NULL; m = mnext) { mnext = m->m_nextpkt; @@ -1055,34 +1259,48 @@ static void ip6_output(m, NULL, NULL, 0, NULL, NULL, NULL); #endif } -done: - callout_reset_on(&V_ipfw_timeout, V_dyn_keepalive_period * hz, - ipfw_tick, vnetx, 0); - CURVNET_RESTORE(); + + /* Run table resize without holding any locks */ + if (new_buckets != 0) + resize_dynamic_table(chain, new_buckets); } +/* + * Deletes all dynamic rules originated by given rule or all rules in + * given set. Specify RESVD_SET to indicate set should not be used. + * @chain - pointer to current ipfw rules chain + * @rule - delete all states originated by given rule if != NULL + * @set - delete all states originated by any rule in set @set if != RESVD_SET + * + * Function has to be called with IPFW_UH_WLOCK held. + */ void +ipfw_expire_dyn_rules(struct ip_fw_chain *chain, struct ip_fw *rule, int set) +{ + + check_dyn_rules(chain, rule, set, 0, 0); +} + +void ipfw_dyn_attach(void) { ipfw_dyn_rule_zone = uma_zcreate("IPFW dynamic rule", sizeof(ipfw_dyn_rule), NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, 0); - - IPFW_DYN_LOCK_INIT(); } void ipfw_dyn_detach(void) { + uma_zdestroy(ipfw_dyn_rule_zone); - IPFW_DYN_LOCK_DESTROY(); } void -ipfw_dyn_init(void) +ipfw_dyn_init(struct ip_fw_chain *chain) { V_ipfw_dyn_v = NULL; - V_dyn_buckets = 256; /* must be power of 2 */ + V_dyn_buckets_max = 256; /* must be power of 2 */ V_curr_dyn_buckets = 256; /* must be power of 2 */ V_dyn_ack_lifetime = 300; @@ -1095,32 +1313,55 @@ void V_dyn_keepalive_interval = 20; V_dyn_keepalive_period = 5; V_dyn_keepalive = 1; /* do send keepalives */ + V_dyn_keepalive = time_uptime; V_dyn_max = 4096; /* max # of dynamic rules */ callout_init(&V_ipfw_timeout, CALLOUT_MPSAFE); - callout_reset_on(&V_ipfw_timeout, hz, ipfw_tick, curvnet, 0); + + resize_dynamic_table(chain, V_curr_dyn_buckets); } void ipfw_dyn_uninit(int pass) { - if (pass == 0) + int i; + + if (pass == 0) { callout_drain(&V_ipfw_timeout); - else { - if (V_ipfw_dyn_v != NULL) - free(V_ipfw_dyn_v, M_IPFW); + return; } + + if (V_ipfw_dyn_v != NULL) { + /* + * Skip deleting all dynamic states - + * uma_zdestroy() does this more efficiently; + */ + + /* Destroy all mutexes */ + for (i = 0 ; i < V_curr_dyn_buckets ; i++) + IPFW_BUCK_LOCK_DESTROY(&V_ipfw_dyn_v[i]); + free(V_ipfw_dyn_v, M_IPFW); + V_ipfw_dyn_v = NULL; + } } +/* + * Returns number of dynamic rules. + */ int ipfw_dyn_len(void) { + return (V_ipfw_dyn_v == NULL) ? 0 : (V_dyn_count * sizeof(ipfw_dyn_rule)); } +/* + * Fill given buffer with dynamic states. + * IPFW_UH_RLOCK has to be held while calling. + */ void -ipfw_get_dynamic(char **pbp, const char *ep) +ipfw_get_dynamic(struct ip_fw_chain *chain, char **pbp, const char *ep) { ipfw_dyn_rule *p, *last = NULL; char *bp; @@ -1130,9 +1371,11 @@ void return; bp = *pbp; - IPFW_DYN_LOCK(); - for (i = 0 ; i < V_curr_dyn_buckets; i++) - for (p = V_ipfw_dyn_v[i] ; p != NULL; p = p->next) { + IPFW_UH_RLOCK_ASSERT(chain); + + for (i = 0 ; i < V_curr_dyn_buckets; i++) { + IPFW_BUCK_LOCK(i); + for (p = V_ipfw_dyn_v[i].head ; p != NULL; p = p->next) { if (bp + sizeof *p <= ep) { ipfw_dyn_rule *dst = (ipfw_dyn_rule *)bp; @@ -1161,7 +1404,9 @@ void bp += sizeof(ipfw_dyn_rule); } } - IPFW_DYN_UNLOCK(); + IPFW_BUCK_UNLOCK(i); + } + if (last != NULL) /* mark last dynamic rule */ bzero(&last->next, sizeof(last)); *pbp = bp; Index: sys/netpfil/ipfw/ip_fw2.c =================================================================== --- sys/netpfil/ipfw/ip_fw2.c (revision 242524) +++ sys/netpfil/ipfw/ip_fw2.c (working copy) @@ -2046,7 +2046,7 @@ do { \ f->rulenum, f->id); cmd = ACTION_PTR(f); l = f->cmd_len - f->act_ofs; - ipfw_dyn_unlock(); + ipfw_dyn_unlock(q); cmdlen = 0; match = 1; break; @@ -2637,7 +2637,7 @@ vnet_ipfw_init(const void *unused) chain->id = rule->id = 1; IPFW_LOCK_INIT(chain); - ipfw_dyn_init(); + ipfw_dyn_init(chain); /* First set up some values that are compile time options */ V_ipfw_vnet_ready = 1; /* Open for business */ --------------000406010204050104020709--