Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 14 Jun 2019 22:13:17 +0200
From:      Peter <pmc@citylink.dinoex.sub.org>
To:        freebsd-ipfw@freebsd.org
Subject:   Re: ipfw: switching sets does stall the machine
Message-ID:  <20190614201317.GA8840@gate.oper.dinoex.org>
In-Reply-To: <20190614172018.GJ1219@albert.catwhisker.org>
References:  <20190614153302.GA4503@gate.oper.dinoex.org> <20190614172018.GJ1219@albert.catwhisker.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Jun 14, 2019 at 10:20:18AM -0700, David Wolfskill wrote:
! On Fri, Jun 14, 2019 at 05:33:02PM +0200, Peter wrote:
! > 
! > Hi,
! > I am trying to use two different configurations (production and test)
! > loaded into different sets, and switch between them with
! > 
! >    # ipfw set disable ... enable ...
! > 
! > When testing my script, this did work, except once the machine went
! > into "swap_pager indefinite wait" and was lost.
! 
! IIRC, this message means that a command was sent to a disk controller
! and at least 20 seconds have elapsed with no response from that
! controller.  That doesn't seem like an "ipfw" issue, per se.

Yes, it usually means that the disk controller has gone fishing, and
that it is recommended to hit the reset button.

! > Then, after reboot (and automatically loading the production rules) I
! > tried to load and switch to the test rules, and immediately got ATA
! > COMMAND TIMEOUT and the machine was lost.
! 
! Again, that's a disk subsystem (apparently) doing Bad Things.

Yes. But not in this case.

! > I repeated this a few times, it is nicely reproducible: withing 3-5
! > seconds after the new rules are loaded, the machine locks up and is
! > lost.
! 
! It's at least plausible that the catalyzing activity causes a certain
! disk I/O pattern that does the actual triggering (I expect).

No. The actual reason is an endless loop in the rule processing,
and network interrupt should run at a higher priority than disk, and
so the disks do not get serviced anymore. The machine is old and has
only 2 CPU.

! My inclination is for you to check the disk drive(s), cabling, and
! controller(s) before much else.

Yeah, been there tonight. Checked and resettled all cabling etc.

But now it figures:
1. as usual the bug was sitting in front of the keyboard.
2. I now tried to immediately delete the old rules after their sets
   are disabled: problem solved, failure gone.
   Obviousely all the keep-stated open network sessions are also gone
   at that point, so this is not really what I wanted.

What happens:

1. the production and test configurations are not exactly identical.
   Which is the whole point in it - if they had to be identical,
   there were no need for two of them.

2. There are dynamic rules involved. These do not disappear on a
   "set disable". They stay and continue to function - somehow.

3. When a packet successfully matches a check-state, it does NOT
   continue to be processed at the rule following that check-state.
   Instead, it does continue to be processed at the place after
   the parent keep-state rule that was originally matched!

   But what if that keep-state rule is now disabled, and the new
   rules do not line up in their numbering in the exact same way?
   Then this packet appears at some arbitrary place in the rule
   list and may go to whereever.

   Obviousely this is not an issue if you do keep-state with simple
   Allow or Deny rules - then the packets leave the system after
   matching.
   But such simple keep-state do not work with NAT. For NAT one needs
   a more elaborate approach, like tagging and branching and
   subroutine calling.
   
So the outcome is: 
   
   When switching sets with such a configuration that introduces
   branches and subroutines, the old and new rules need to precisely
   line up to each other, so that the old dynamic rules (which should
   be kept for the network sessions to persist) can reinsert their
   matched packets at places where correct further processing happens.

   Doesn't seem like an easy task...



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20190614201317.GA8840>