From owner-freebsd-stable@FreeBSD.ORG Fri Jul 17 20:09:08 2009 Return-Path: Delivered-To: freebsd-stable@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 395A91065670 for ; Fri, 17 Jul 2009 20:09:08 +0000 (UTC) (envelope-from news@citylink.dinoex.sub.org) Received: from uucp.dinoex.sub.de (uucp.dinoex.sub.de [194.45.71.2]) by mx1.freebsd.org (Postfix) with ESMTP id AB3EC8FC12 for ; Fri, 17 Jul 2009 20:09:07 +0000 (UTC) (envelope-from news@citylink.dinoex.sub.org) Received: from uucp.dinoex.sub.de (uucp@uucp.dinoex.sub.de [194.45.71.2] (may be forged)) by uucp.dinoex.sub.de (8.14.3/8.14.2) with ESMTP id n6HJcAXX085623 for ; Fri, 17 Jul 2009 21:38:10 +0200 (CEST) (envelope-from news@citylink.dinoex.sub.org) X-MDaemon-Deliver-To: Received: from citylink.dinoex.sub.org (uucp@localhost) by uucp.dinoex.sub.de (8.14.3/8.14.2/Submit) with UUCP id n6HJcAtF085622 for freebsd-stable@FreeBSD.ORG; Fri, 17 Jul 2009 21:38:10 +0200 (CEST) (envelope-from news@citylink.dinoex.sub.org) Received: from gate.oper.dinoex.org (gate-e [192.168.98.2]) by citylink.dinoex.sub.de (8.14.3/8.14.2) with ESMTP id n6HJYYlp003318 for ; Fri, 17 Jul 2009 21:34:34 +0200 (CEST) (envelope-from news@gate.oper.dinoex.org) Received: from gate.oper.dinoex.org (gate-e [192.168.98.2]) by gate.oper.dinoex.org (8.14.3/8.14.3) with ESMTP id n6HJY2Bd003205 for ; Fri, 17 Jul 2009 21:34:03 +0200 (CEST) (envelope-from news@gate.oper.dinoex.org) Received: (from news@localhost) by gate.oper.dinoex.org (8.14.3/8.14.3/Submit) id n6HJY28D003193 for freebsd-stable@FreeBSD.ORG; Fri, 17 Jul 2009 21:34:02 +0200 (CEST) (envelope-from news) From: pmc@citylink.dinoex.sub.org (Peter Much) Message-ID: Date: Fri, 17 Jul 2009 19:12:06 GMT Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=ISO-8859-1 Mime-Version: 1.0 Organization: dread of the bookshelf X-Newsreader: trn 4.0-test76 (Apr 2, 2001) Sender: To: freebsd-stable@FreeBSD.ORG X-Milter: Spamilter (Reciever: uucp.dinoex.sub.de; Sender-ip: 194.45.71.2; Sender-helo: uucp.dinoex.sub.de; ) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.2 (uucp.dinoex.sub.de [194.45.71.2]); Fri, 17 Jul 2009 21:38:11 +0200 (CEST) Cc: Subject: Can an app crash from a single TCP packet lost in transmission? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 17 Jul 2009 20:09:08 -0000 The first thing I noticed was that my nameserver had gone. I searched for the reason and found: >Jul 15 04:04:52 edge kernel: swap_pager_getswapspace(3): failed < ... hundreds more of these ... > >Jul 15 04:05:07 edge kernel: pid 47113 (named), uid 53, was killed: out of swap space That didn't make sense - the machine has enough swapspace. But since this did repeat every other night, I started logging ps output minutely. And so I found a postgres database backup going weird: 03:23 70 78433 78432 0 96 0 8220 4196 - R ?? 0:22.84 pg_dump -b < ... > 03:49 70 78433 78432 0 96 0 8220 4024 - R ?? 17:06.61 pg_dump -b 03:50 70 78433 78432 0 96 0 8220 4024 - R ?? 17:46.15 pg_dump -b 03:51 70 78433 78432 0 96 0 8220 4024 - R ?? 18:26.69 pg_dump -b 03:52 70 78433 78432 0 47 0 139292 57888 select S ?? 18:37.65 pg_dump -b 03:53 70 78433 78432 0 48 0 139292 57828 select S ?? 18:40.36 pg_dump -b 03:54 70 78433 78432 0 -20 0 401436 69092 swread DL ?? 18:42.49 pg_dump -b 03:55 70 78433 78432 0 -20 0 401436 63232 swread DL ?? 18:43.99 pg_dump -b That process starts with 8MB memory, and runs so for half an hour, then suddenly between 03:51 and 03:52 memory usage explodes. And in that night it did not run out of swap space - instead it gave an error message: >pg_dump: Error message from server: lost synchronization with server: > got message type "0", length 154143043 >pg_dump: The command was: COPY public.file (fileid, fileindex, jobid, > pathid, filenameid, markid, lstat, md5) TO stdout; But that database backup is at that time quite in the middle of dumping a db table containing lots of small records - there is no reason why a 154 MB "message" should be transferred between server and client while copying records of ~60 Bytes each. One other thing did happen between 03:51 and 03:52 - the DSL internet connection did disconnect/reconnect and obtained a new IP adress. Afterwards, a script does flush and reload an ipfw table() with the new local adresses - and during this process one(!) packet of the database session was dropped. I could verify that relation: every night when there were memory problems, few packets from the database backup were lost during the firewall reconfigure - in nights when no packets were lost, there were no memory problems. I will now change the firewall handling to get rid of that packet loss, but also, I need some refresh on how TCP works: I thought TCP would not be disturbed by a lost packet, but would automatically resend that packet until ACK received; and I thought this would happen below the application, so practically the application CANNOT go weird from a lost packet... Is there any reason why this would not be true on a localhost connection? rgds, PMc