Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 1 Dec 2008 12:11:50 -0800
From:      Jo Rhett <jrhett@netconsonance.com>
To:        Ken Smith <kensmith@cse.buffalo.edu>
Cc:        freebsd-stable Stable <freebsd-stable@freebsd.org>
Subject:   Re: Can I get a committer to mark this bug as blocking 6.4-RELEASE ?
Message-ID:  <C6BDF7BE-52D2-41EB-BD9B-0371B0DD0962@netconsonance.com>
In-Reply-To: <1228159822.15856.45.camel@bauer.cse.buffalo.edu>
References:  <A5A9A4D4-CD16-45FA-A2AC-62C4B5AE976D@netconsonance.com> <BEBF7B15-DECE-4872-9687-4AD4BE65DB05@netconsonance.com> <84E1EC10-5323-4A8C-AD60-31142621DB32@netconsonance.com> <200810271151.47366.jhb@freebsd.org> <C6DC3DB1-40FF-4896-81DB-EF37874428AF@netconsonance.com> <280616DD-A58F-4AE5-AB03-92C5F2C244EC@netconsonance.com> <1227733967.83059.1.camel@neo.cse.buffalo.edu> <EC872352-4A50-404E-A93E-DBA5FCAA1431@netconsonance.com> <1228159822.15856.45.camel@bauer.cse.buffalo.edu>

next in thread | previous in thread | raw e-mail | index | archive | help
On Dec 1, 2008, at 11:30 AM, Ken Smith wrote:
> Both John and Xin Li have chimed in on the two threads I've seen that
> are related to this specific topic.  John diagnosed it as a issue with
> the BIOS.  That's what makes it a nebulous problem.  When working on
> those sorts of things most people liken it to "Whack-a-mole".

Diagnosed without testing.  John never asked for any more information  
than the page fault description from me.  When I asked what else to  
test and offered to supply systems for testing he stopped responding.   
Xin Li proposed a work-around that would have castrated the systems.   
It might work, but it wasn't a useful workaround so I deferred testing  
and focused on trying to get someone to address the real problem.

>> This is very big problem that will affect thousands of freebsd  
>> servers.
>
> Its still not clear it will affect thousands of servers.

Um... Rackable.   Rackable ships cabinets full of systems to people  
that run FreeBSD.  They don't sell to home or small corporate users,  
period.  Any problem that affects a standard Rackable build will by  
definition affect thousands of systems.  (much like any standard Dell  
or HP server build)

> This all left me with a decision.  My choices were to back out the BTX
> changes that were known to fix boot issues with certain motherboards  
> and
> enabled booting from USB devices or leave things as they are.

Or do some more testing and determine the problem and fix it.  I had a  
stack of systems demonstrating the problem.  I could have shipped one  
to each freebsd developer you wanted to work on it.  If you were  
willing to identify the affect source code and relevant gdb traps I  
would have happily worked on the source directly if that is what it  
took.

I would test.  I would supply console access and build systems.  I  
would ship them to anyone who wanted one in their hot little hands.  I  
would investigate the source code myself with a mere hour of "here's  
the relevant bits you need to consider" training.

You could have done *anything* that suited your needs for testing.   
Instead you did nothing.

> The
> motherboards that didn't boot with the older code had no work-around.
> The motherboards that did boot with the older code but not the newer
> code do have a work-around (use the old loader).

Not true.  I tested this, installing the old loader and it did not  
change the problem.  As reported.

> Decisions like that
> suck, no matter which choice I make it's wrong.  Holding the release
> until all bios issues get resolved isn't a viable option because of  
> the
> "Whack-a-mole" thing mentioned above.  Fix it for one and two  
> break.  It
> takes a lot of time/work to settle into what seems to work for the
> widest set of machines.

Break the boot loader for a very wide variety of systems rather than  
spend EVEN A SINGLE HOUR trying to diagnose the boot problem?

Ken, your diagnosis here would make sense if ANY diagnosis had been  
attempted.  This could be a trivial problem.  It could be solved with  
5 minutes of actually looking at it.  What happened here is that you  
proceeded WITHOUT EVEN TRYING.

> So you're saying John and Xin Li's responses (Xin Li's questions still
> un-answered) to you show a complete lack to even consider  
> investigating
> it?

No actual diagnosis was done.  I'm sorry, but if I pull my car up to  
my mechanic's garage and he makes a diagnosis of "no idea what's  
wrong" without even popping the hood, yeah that counts as "didn't even  
consider investigating"

Worse yet, I would happily have done all of the grunt work for the  
investigation.  But I'm not going to start by reading the source tree  
and making guesses where to look.  If someone had given me some useful  
tests to do, I would have done them.

> I know from past email threads your preference is for 6.X right now

Not my preference, my ability to justify the evaluation and testing  
costs based on the support available for a given release.  7.0 doesn't  
work on this hardware at all.  No, I haven't tested 7.1 because 6.4  
was the easier testing target and I had thought that the security team  
was working on fixing the support model.

So now we have the brilliance strategy of a long-term support -REL  
that we will never be able to use.  The same stupid stunt that gave us  
6.1 which was unusable and 6.2 which worked great but expired at the  
same time as 6.1.  Etc and such forth.  6.5 will likely be short term  
support again, but the first release we can consider for deployment.

> but as a test point if you aren't totally fried over this whole  
> thing it
> would still be useful to know for sure if the issue exists with 7.1  
> test
> builds.  If yes it eliminates a variety of possibilities and helps  
> focus
> on the exact problem.

I'm not burnt, but testing 7.1 has no meaningful relevance to my day  
job until we have a reasonable and working support mechanism.

And given that I really pulled out the stops to make sure we had  
hardware for testing 6.4 (I went a bought a whole stack of systems  
*JUST FOR THIS*) and filed PRs and followed up, and couldn't get much  
more than "it sounds like this" kind of response ... seriously, would  
you invest a lot of time testing a very unstable release under those  
conditions?  I mean jesus, 6.4 is supposed to be truly stable and yet  
you're willing to ship it not working with dozens of nearly identical  
reports of the same symptoms for both 6.4 and 7.1?

Think seriously about what happened here, and how exactly I'm supposed  
to convince any executive of the logic of trying to test 7.1, when  
we're stuck on 6.3 until/if 6.5, which will be screwed for support?  I  
mean seriously?

The problem BTW is *EXACTLY* why I proposed the revisions to the  
support policy I did.  Now you're stuck supporting 6.4 for 2 years,  
and you won't want to release 6.5 because you'll end up supporting  
three 6.x releases at the same time.  Which would suck.  Which is  
exactly what my proposed change to the policy would have fixed.

FreeBSD has usually been a solid product on the more stable releases.   
It's really unfortunate that the release management is so willing to  
ignore the evidence which leads to major releases with serious flaws,  
and on top of that seems to take delight in marking the known flawed  
releases as the long support releases. Brilliance.  Just plain  
brilliant, top to bottom.

-- 
Jo Rhett
Net Consonance : consonant endings by net philanthropy, open source  
and other randomness



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?C6BDF7BE-52D2-41EB-BD9B-0371B0DD0962>