From owner-freebsd-stable@FreeBSD.ORG Wed May 3 10:25:44 2006 Return-Path: X-Original-To: stable@freebsd.org Delivered-To: freebsd-stable@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 5E25C16A400 for ; Wed, 3 May 2006 10:25:44 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.FreeBSD.org (Postfix) with ESMTP id E332843D49 for ; Wed, 3 May 2006 10:25:43 +0000 (GMT) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id 33AE246CF3; Wed, 3 May 2006 06:25:43 -0400 (EDT) Date: Wed, 3 May 2006 11:25:43 +0100 (BST) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Kris Kennaway In-Reply-To: <20060502182302.GA92027@xor.obsecurity.org> Message-ID: <20060503110503.O58458@fledge.watson.org> References: <20060502171853.GG753@dimma.mow.oilspace.com> <20060502172225.GA90840@xor.obsecurity.org> <20060502174429.GH753@dimma.mow.oilspace.com> <44579EE1.6010300@rogers.com> <20060502180557.GA91762@xor.obsecurity.org> <4457A02C.9040408@rogers.com> <20060502182302.GA92027@xor.obsecurity.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Mike Jakubik , stable@freebsd.org Subject: Re: quota deadlock on 6.1-RC1 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 03 May 2006 10:25:44 -0000 On Tue, 2 May 2006, Kris Kennaway wrote: >>>> Ditto, same thing with the recent nve fixes. Why release known broken >>>> code when there are tested patches available? Whats the worst that will >>>> happen? It wont work? Thats already the case... <...> > > OK, I can't speak to that issue specifically. > > Generally, though, the worst that can happen is "you fix one problem > affecting a subset of users and replace it with a larger problem affecting a > larger subset of users". > > If there's doubt about the impact of a change, 10 seconds before the release > is not the appropriate time to cram it in. <...> I just want to comment a bit on this issue, because I've seen a number of posts on FreeBSD mailing lists over the last few years that suggest that there may be some misunderstandings about software development and releases processes. The invariant that needs to be understood is that all software is buggy; arguments have been made that the number of bugs increases linearly with code size, and there have also been arguments made that the number of bugs increases with code complexity, so you can see a non-linear increase in bugs with code growth. This means that you're talking about several bugs per thousand lines of code in most software, and for code that contains millions of lines of code (such as the FreeBSD kernel, Linux kernel, Apache, PhP, MySQL, PostgreSQL, Windows, Word, iTunes, etc), you're talking thousands or tens of thousands of bugs. And that's in a static version of the code, not even taking into account new features in an active code base that are still being "debugged"! Bugs fall into a lot of different categories, but from the perspective of risk management, it's useful to think of them in two categories: latent bugs, which are unreported, unobserved, or occur only in exceptional or generally untriggered circumstances, and non-latent bugs, which have been reported, are triggered in practice, etc. The tricky ones are the latent bugs, because you may not know that they are there, or you may know that they are there but trigger so infrequently or in such unusual edge cases that they almost might as well not be there. Release engineering is really about two things: structuring/nurturing the process of developing releases (tracking issues, identifying people to fix them, testing, branch management, building, etc), and risk management. The risk management aspect is that you want to improve the quality of the release by taking actions, typically adopting source changes, which may improve testing results. Each change potentially affects both visible and latent bugs. Bug fixes in one piece of code may change the timing of the code, the side effects, undocumented assumptions, or simply allow access to code previously not executed because the bug prevented it. If you allow a bug fix into the tree, you risk uncovering new bugs. So the choice isn't "Accept a bug fix or not", it's "Will accepting this bug fix generally improve or reduce quality of the release" -- i.e., will the change fix the bug it is claimed to fix, and will it result in lots of latent bugs suddenly becoming visible. Particular with hardware drivers like nve, this is non-trivial, because the behavior of the hardware is very subtle, there's lots of variety in the shipped hardware, and the vendor is (or appears) highly unsupportive. The result is that if you tweak a register or minor piece of behavior, it dramatically improve support for a particular piece of hardware, but break all the rest. The only way to mitigate this risk is through extensive testing, and extensive testing takes a lot of time. And by a lot of time, I mean, a long release cycle. So if we want to adopt a fix that is high risk -- i.e., is believed will interact in subtle ways that affect different machines differently -- we need to make the change early in the release cycle, not at the end. If we make it at the end, we are shipping code that is effectively untested on a large number of systems. Sure, it will fix one, but if it breaks the rest, is it worth it? The only alternative is to restart the testing process, which in the case of high-risk drivers, means adding months to the release cycle. And you can see where this is leading: if you significantly delay the release cycle for each minor bug, you will never release. At some point, you have to make the decision "although this release isn't perfect, we'll never release if we don't ship now". I know that sounds like a bad thing, but you'll find that that practice is not only found in every part of the software industry, but it's also impossible to avoid, since bug-free software is impossible to avoid. When you look at the RC2 release notes Scott recently sent, he identifies four bugs that he believes won't be fixed in time for the release. He decided that this was the case using risk management: each bug actually likely represents several bugs with the same features, in highly complex code. This means that they will take a significant amount of time to fix, and that each fix is high risk, as it is likely to reveal latent bugs. This means that each fix will require a lot of testing -- months of testing, in fact. So the choice is really, do we release 6.1, or do we skip it and do a 6.2 in a few months. As the release engineer, Scott has concluded that releasing now offers a great benefit to many people, although the bugs present may penalize some. Mind you, in some cases the bugs also exist in 6.0, so they don't represent regressions, so much as bugs that continue to persist. I agree with his conclusion: things like locking interactions in VFS are incredibly complicated, requiring extensive analysis and work to fix and test. Trying to fix them for 6.1 is unrealistic. They can be fixed in the next few weeks, tested for a month or two, and then merged to the RELENG_6_1 branch as errata fixes, similar to security advisories. It's all about trade-offs. People are welcome to (and frequently do) disagree with our analysis and choice on the trade-offs, but what I'm trying to emphasize in this e-mail is that these trade-offs are a reality. They can't be ignored: bug-free releases of software can't be shipped because they don't exist, and therefore the argument (decision) is always about where the right balance is. Arguing for waiting to ship until every last bug is fixed is arguing never to release software -- bugs are present in all software, and not all latent either -- that's why products have errata notes (as does FreeBSD), patch levels, etc. Don't believe this means we don't think fixing bugs is important, and that we don't spend long days and nights (and more days and more nights) working on it. FWIW, if you look at the release process of any other commercial or open source software product, you'll see the same thing. Either there's no bug database, or there's a very large database. If there's no database, it's because the developer isn't being honest about there being bugs, or they have no testing. If there's a huge database, they are, and they're not all going to get shipped. Software authors select bugs to fix based on the impact of the bugs and their ability to fix them. I'd like to think we care more than some, but caring isn't enough to make computer software development perfect, or it would have happened a long time ago :-). Thanks, Robert N M Watson