From owner-freebsd-stable@FreeBSD.ORG Tue Jan 16 21:20:55 2007 Return-Path: X-Original-To: freebsd-stable@freebsd.org Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 2D44A16A416 for ; Tue, 16 Jan 2007 21:20:55 +0000 (UTC) (envelope-from kris@obsecurity.org) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.freebsd.org (Postfix) with ESMTP id 1053213C468 for ; Tue, 16 Jan 2007 21:20:55 +0000 (UTC) (envelope-from kris@obsecurity.org) Received: from obsecurity.dyndns.org (elvis.mu.org [192.203.228.196]) by elvis.mu.org (Postfix) with ESMTP id AEA571A4D89; Tue, 16 Jan 2007 13:20:53 -0800 (PST) Received: by obsecurity.dyndns.org (Postfix, from userid 1000) id 0CBE651A00; Tue, 16 Jan 2007 16:20:49 -0500 (EST) Date: Tue, 16 Jan 2007 16:20:48 -0500 From: Kris Kennaway To: Doug Ambrisko Message-ID: <20070116212048.GA1041@xor.obsecurity.org> References: <20070116203739.GA343@xor.obsecurity.org> <200701162117.l0GLHXOS062816@ambrisko.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="qMm9M+Fa2AknHoGS" Content-Disposition: inline In-Reply-To: <200701162117.l0GLHXOS062816@ambrisko.com> User-Agent: Mutt/1.4.2.2i Cc: Scott Oertel , Willem Jan Withagen , freebsd-stable@freebsd.org, Kris Kennaway Subject: Re: running mksnap_ffs X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Jan 2007 21:20:55 -0000 --qMm9M+Fa2AknHoGS Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, Jan 16, 2007 at 01:17:33PM -0800, Doug Ambrisko wrote: > Kris Kennaway writes: > | On Tue, Jan 16, 2007 at 09:26:47PM +0100, Willem Jan Withagen wrote: > | > Doug Ambrisko wrote: > | > >| > or things can get wedged. We have some other patches as well th= at=20 > | > >might > | > >| > be required. As a hack on a local server we have been using sna= p shots > | > >| > to do a "hot" back-up of a data base each morning. This is base= d on > | > >| > 6.x. > | > >| > | > >| What do you mean by "get wedged"? Are you seeing a deadlock, and = if > | > >| so then what are the details? When you say 6.x, do you mean > | > >| up-to-date RELENG_6? There were various snapshot deadlock fixes > | > >| committed over the past year including some in the past few months. > | > > > | > >The file-system would come to a stop, processes stuck on bio, snap-s= hots > | > >not finishing etc. This was caused by the system running out of usa= ble > | > >buffers. The change forces them to be flushed every so often. This= is > | > >independant of locking. 10 might be to aggresive. Some scaling of > | > >nbuf would probably be better. > | >=20 > | > When I run mksnap_ffs it runs to the point where ANY access to the=20 > | > filesystem gives that process a lockup. > |=20 > | Yes, that is expected. Actually it begins when something accesses the > | directory in which the snapshot is being made, since that causes the > | parent directory to be locked...then something tries to access the > | parent directory, which eventually cascades back to the root. > |=20 > | > Getting the file system back is only thru "hard reboot". Trying to do= it=20 > | > the gentle way locks the whole system. > |=20 > | Or waiting until the snapshot operation finishes. You (still) haven't > | determined that it's actually hanging as opposed to just waiting for > | the snapshot operation to finish. >=20 > In my case is was easy to see that all the buffers were exhausted and > the system was churning waiting for some to become available. Since they > were all used up it never recovered. By sync'ing the buffers they got > cleaned up and then the system never ran out. The snap shot was then > able to finish. Via the debugger you can see this happen. I traced > this problem in the debugger. There are other issues with the buffer > deamon as well. We hit these since we run with a relatively low > nbuf. The buffers can be get frag'ed so bad that it can't flush > things since it can't get a full-size buffer. Another problem is that > it can end up waiting on itself since the current code can't use > it's emergency space to flush stuff. You can see this via ps etc. > It's not a good thing if the buffer daemon is waiting on itself :-( >=20 > We have patches to this as well but they need some more work. I was > working with Tor, on this but then I got swamped at work with our 4.X -> = 6.X > and platform transition. All I can say is that we don't suffer from > these problems now :-) I have printf's the log this stuff when some of > these bugs are hit. Now the system survives those lock-up points. Thanks for clarifying. Hopefully you and Tor can get something committed soon! Kris --qMm9M+Fa2AknHoGS Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (FreeBSD) iD8DBQFFrUGwWry0BWjoQKURArg/AJ0dUnhnHUtm7zB8IZut5UEbeEf7fwCgl4kP N9uy1f2iov1VWR6rqKtwuAk= =H6Yy -----END PGP SIGNATURE----- --qMm9M+Fa2AknHoGS--