Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 11 May 2002 00:55:47 -0700
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Gordon Tetlow <gordont@gnf.org>
Cc:        hackers@freebsd.org
Subject:   Re: nextboot loader diff
Message-ID:  <3CDCCE83.66AEF4BB@mindspring.com>
References:  <Pine.LNX.4.44.0205101634570.27477-100000@smtp.gnf.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Gordon Tetlow wrote:

[ ... ]

You *did* ask for comments...


> > There should be a list, so that in a brown-out or whatever, you
> > don't end up toggling back to the previous version accidently.
> 
> This is not something that is meant for you to massage which root
> partition you are going to boot up off of.

I don't understand what it does, then.  The original Whistle code
was intended to attempt to boot 3 times from one partition, and
then 3 times from another.

If a boot was successful, then in the last rc file before the getty's
were started, it reset the list to 3 times the current root and 3
times the alternate root.

That way, on each success, the counter was reset, so in general, a
given root was sticky.

When the failure occurred, then the alternate root was the one whose
rc files ran, and it became the sticky one.

Worst case, you could power cycle a box three times quickly to force
a switch back to an older version.

The general failure case is not an indefinite hang, but a reset before
the rc file runs.  This is particularly true when you have a hardware
watchdog, where the first thing that happens is the watchdog is set.

Note that images are tested before they are shipped, so the worst
case failure is "out of memory" or some other installation failure
related problem, and not a kernel problem, anyway.

I've personally had to solve this same problem several times now.


> > You should only ever rewrite the contents of a single file, and
> > it shouldn't be an important file.
> 
> Yes, that's exactly what my patch does.

I don't understand the "YES"/"NO" thing, then.  There is one byte
difference in the file length, which I don't think can be properly
accounted, if you do the "YES"/"NO" thing.


> > The existance/non-existance of the single file should be enough
> > to trigger/suppress the nextboot behaviour.
> 
> I can't unlink files in the loader, so the presence of such a file
> wouldn't help.

The file is the nextboot.conf file.  And unlinking it is not something
which you want to do, actually.  I think we are misunderstanding each
other's intent here.


> > Don't assume that the nextboot file will be on the same disk and/or
> > partition as the boot and other config file code.
> 
> Well, I'm assuming it's on the root partition. It would be kinda silly for
> it to anywhere else.

Not really.  Consider that if I switch root partitions, then, by
definition, I switch nextboot files.

Basically, the InterJet was laid out:

	boot code (including nextboot list)
	/ #1		<- version X of the system (read only)
	/ #2		<- version Y of the system (read only)
	swap
	/var		<- log files and /tmp
	/data		<- user data (config, user files, etc.)

The fstab's on #1 and #2 were opposite, so that you could mount and
overwrite the contents with a new release of the software.

An upgrade was:

	mount opposite "root"
	unpack new system image onto opposite root
	set up opposite root fstab
	sync
	unmount
	nextboot "opposite opposite opposite this this this"
	reboot

Each revision had data management upgrade/downgrade scripts; these
were written to /data, so that opposite versions could downgrade.


> > Together, these things will allow the new code to solve the same
> > problem that the old code solved on the InterJet.
> 
> I've never heard nor seen the old code. I don't know what it did, and I
> don't particularly care. I did this because I thought the way Wes Peters
> did his implementation was rather hackish (not saying mine is any better
> =) and suboptimal if the machine doesn't make it to multi-user. Please
> refer to the commit logs from earlier this month if you don't know of the
> commit I'm referring to.

I do.  He committed some, but not all, of the code that Jon Mini
and James wrote (Jon says some of it was based on code I wrote).
The design I did at ClickArray was based on the Whistle design
from when I worked at Whistle with Julian and Archie.

The ClickArray code, if it was intended to solve the problem that
the code it was supposedly derived from was intended to solve is
for solving the remote upgrade problem, with no local removable
media that can be used to recover from a catastrophic failure
(the only recovery from such a failure is a fallback to a working
previous revision, per the InterJet).

The code you are talking about seems limited to replacing only the
kernel.  Frankly, that's recoverable via the serial console, if
you put the "-p" in the right file in /.

This isn't really sufficient for any embedded system that needs to
get at netstat, ps, or other data which involves examination of
kernel structures, which may change between kernel versions.  You
pretty much have to have two system images to solve that problem,
or you'll find youself incredibly screwed, when the web UI, the
CLI, SNMP, and the front panel LCD all start reporting random bogus
data.  8-(.

I'm not trying to dump on your code; I'm just saying that it's
not solving the problem that the original code was added to be
able to solve, and that the original nextboot itself was intended
to resolve.

You asked for comments ...those are mine.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3CDCCE83.66AEF4BB>