From owner-freebsd-ipfw@freebsd.org Fri Jun 14 21:13:23 2019 Return-Path: Delivered-To: freebsd-ipfw@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5954315B1652 for ; Fri, 14 Jun 2019 21:13:23 +0000 (UTC) (envelope-from pmc@citylink.dinoex.sub.org) Received: from uucp.dinoex.org (uucp.dinoex.sub.de [IPv6:2001:1440:5001:1::2]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) server-signature RSA-PSS (4096 bits) client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "uucp.dinoex.sub.de", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 96A4381518 for ; Fri, 14 Jun 2019 21:13:22 +0000 (UTC) (envelope-from pmc@citylink.dinoex.sub.org) Received: from uucp.dinoex.sub.de (uucp.dinoex.sub.de [194.45.71.2]) by uucp.dinoex.org (8.16.0.41/8.16.0.41) with ESMTPS id x5ELD42X011621 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NO) for ; Fri, 14 Jun 2019 23:13:05 +0200 (CEST) (envelope-from pmc@citylink.dinoex.sub.org) X-MDaemon-Deliver-To: Received: from citylink.dinoex.sub.org (uucp@localhost) by uucp.dinoex.sub.de (8.16.0.41/8.16.0.41/Submit) with UUCP id x5ELD4Xo011620 for freebsd-ipfw@freebsd.org; Fri, 14 Jun 2019 23:13:04 +0200 (CEST) (envelope-from pmc@citylink.dinoex.sub.org) Received: from gate.oper.dinoex.org (gate-e [192.168.98.2]) by citylink.dinoex.sub.de (8.15.2/8.15.2) with ESMTP id x5EKFiuq017457 for ; Fri, 14 Jun 2019 22:15:44 +0200 (CEST) (envelope-from peter@gate.oper.dinoex.org) Received: from gate.oper.dinoex.org (gate-e [192.168.98.2]) by gate.oper.dinoex.org (8.15.2/8.15.2) with ESMTP id x5EKDHgD017159 for ; Fri, 14 Jun 2019 22:13:17 +0200 (CEST) (envelope-from peter@gate.oper.dinoex.org) Received: (from peter@localhost) by gate.oper.dinoex.org (8.15.2/8.15.2/Submit) id x5EKDHv1017158 for freebsd-ipfw@freebsd.org; Fri, 14 Jun 2019 22:13:17 +0200 (CEST) (envelope-from peter) Date: Fri, 14 Jun 2019 22:13:17 +0200 From: Peter To: freebsd-ipfw@freebsd.org Subject: Re: ipfw: switching sets does stall the machine Message-ID: <20190614201317.GA8840@gate.oper.dinoex.org> References: <20190614153302.GA4503@gate.oper.dinoex.org> <20190614172018.GJ1219@albert.catwhisker.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190614172018.GJ1219@albert.catwhisker.org> User-Agent: Mutt/1.11.4 (2019-03-13) X-Milter: Spamilter (Reciever: uucp.dinoex.sub.de; Sender-ip: 194.45.71.2; Sender-helo: uucp.dinoex.sub.de; ) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.6.2 (uucp.dinoex.org [194.45.71.2]); Fri, 14 Jun 2019 23:13:08 +0200 (CEST) X-BeenThere: freebsd-ipfw@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: IPFW Technical Discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 14 Jun 2019 21:13:23 -0000 On Fri, Jun 14, 2019 at 10:20:18AM -0700, David Wolfskill wrote: ! On Fri, Jun 14, 2019 at 05:33:02PM +0200, Peter wrote: ! > ! > Hi, ! > I am trying to use two different configurations (production and test) ! > loaded into different sets, and switch between them with ! > ! > # ipfw set disable ... enable ... ! > ! > When testing my script, this did work, except once the machine went ! > into "swap_pager indefinite wait" and was lost. ! ! IIRC, this message means that a command was sent to a disk controller ! and at least 20 seconds have elapsed with no response from that ! controller. That doesn't seem like an "ipfw" issue, per se. Yes, it usually means that the disk controller has gone fishing, and that it is recommended to hit the reset button. ! > Then, after reboot (and automatically loading the production rules) I ! > tried to load and switch to the test rules, and immediately got ATA ! > COMMAND TIMEOUT and the machine was lost. ! ! Again, that's a disk subsystem (apparently) doing Bad Things. Yes. But not in this case. ! > I repeated this a few times, it is nicely reproducible: withing 3-5 ! > seconds after the new rules are loaded, the machine locks up and is ! > lost. ! ! It's at least plausible that the catalyzing activity causes a certain ! disk I/O pattern that does the actual triggering (I expect). No. The actual reason is an endless loop in the rule processing, and network interrupt should run at a higher priority than disk, and so the disks do not get serviced anymore. The machine is old and has only 2 CPU. ! My inclination is for you to check the disk drive(s), cabling, and ! controller(s) before much else. Yeah, been there tonight. Checked and resettled all cabling etc. But now it figures: 1. as usual the bug was sitting in front of the keyboard. 2. I now tried to immediately delete the old rules after their sets are disabled: problem solved, failure gone. Obviousely all the keep-stated open network sessions are also gone at that point, so this is not really what I wanted. What happens: 1. the production and test configurations are not exactly identical. Which is the whole point in it - if they had to be identical, there were no need for two of them. 2. There are dynamic rules involved. These do not disappear on a "set disable". They stay and continue to function - somehow. 3. When a packet successfully matches a check-state, it does NOT continue to be processed at the rule following that check-state. Instead, it does continue to be processed at the place after the parent keep-state rule that was originally matched! But what if that keep-state rule is now disabled, and the new rules do not line up in their numbering in the exact same way? Then this packet appears at some arbitrary place in the rule list and may go to whereever. Obviousely this is not an issue if you do keep-state with simple Allow or Deny rules - then the packets leave the system after matching. But such simple keep-state do not work with NAT. For NAT one needs a more elaborate approach, like tagging and branching and subroutine calling. So the outcome is: When switching sets with such a configuration that introduces branches and subroutines, the old and new rules need to precisely line up to each other, so that the old dynamic rules (which should be kept for the network sessions to persist) can reinsert their matched packets at places where correct further processing happens. Doesn't seem like an easy task...