Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 13 Dec 2017 10:48:40 -0700
From:      Gary Aitken <freebsd@dreamchaser.org>
To:        Polytropon <freebsd@edvax.de>
Cc:        Adam Vande More <amvandemore@gmail.com>, FreeBSD Questions <freebsd-questions@freebsd.org>
Subject:   Re: Subject: Thunderbird causing system crash, need guidance
Message-ID:  <2d7c5545-d87d-6733-f85e-a53921afa67a@dreamchaser.org>
In-Reply-To: <20171213133627.49a5e53b.freebsd@edvax.de>
References:  <201712110045.vBB0jCTQ078476@nightmare.dreamchaser.org> <CA%2BtpaK0sG31TckxL8orNmAD0ZXSz7rJzEotjsCEtASw9u2COZg@mail.gmail.com> <38e2ef70-fa1b-25bf-4447-752006418d0a@dreamchaser.org> <20171211135803.d1aff6c8.freebsd@edvax.de> <5fbcd05c-ce12-b1a4-a9e9-79276dad7183@dreamchaser.org> <20171212200126.3ddf75e5.freebsd@edvax.de> <603b487e-d1b7-eb98-6bcd-f2c2c6d3b843@dreamchaser.org> <20171213133627.49a5e53b.freebsd@edvax.de>

next in thread | previous in thread | raw e-mail | index | archive | help
On 12/13/17 05:36, Polytropon wrote:
> On Tue, 12 Dec 2017 22:06:35 -0700, Gary Aitken wrote:
>> On 12/12/17 12:01, Polytropon wrote:
>>> On Tue, 12 Dec 2017 11:30:26 -0700, Gary Aitken wrote:
>>>> On 12/11/17 05:58, Polytropon wrote:
>>>>> On Sun, 10 Dec 2017 21:56:16 -0700, Gary Aitken wrote:
>>>>>> On 12/10/17 19:02, Adam Vande More wrote:
>>>>>>> On Sun, Dec 10, 2017 at 6:45 PM, Gary Aitken wrote:
>>>> <snip>
>>>>>> However, I'm confused. Upon reboot, the system checks to see if
>>>>>> file systems were properly dismounted and is supposed to do an
>>>>>> fsck.  Since those don't show up in messages, I can't verify
>>>>>> this, but I'm pretty certain it must have thought it was clean,
>>>>>> which it wasn't.  (One reason I'm pretty certain is the time
>>>>>> involved when run manually as you suggested).
>>>>>
>>>>> This is the primary reason for setting
>>>>>
>>>>> background_fsck="NO"
>>>>
>>>> Already had that set for just that reason.
>>>>
>>>>> in /etc/rc.conf - if you can afford a little downtime. The
>>>>> background fsck doesn't have all the repair capabilities a forced
>>>>> foreground check has, to it _might_ leave the file system in an
>>>>> inconsistent state, and the system runs with that unclean
>>>>> partition.
>>>>>
>>>>>> The file system in question was mounted below "/". Does the
>>>>>> system only auto-check file systems mounted at "/"?
>>>>>
>>>>> Yes, / is the first file system it checks. The two last fields in
>>>>> /etc/fstab control what fsck will check, and /etc/rc.conf allows
>>>>> additional flags for those automatic checks.
>>>>
>>>> The ordering part I understand; what I don't understand is why it
>>>> (as I recall) rebooted successfully with no warnings in spite of
>>>> the background_fsck="NO" being set and when one of the disks
>>>> apparently didn't fsck properly.  I thought it should have halted
>>>> in single-user mode and waited for me to do a full fsck manually.
>>>> Unfortunately, the fsck output is not printed to the log, and I
>>>> logged in as root on the vt0 device, so it had scrolled off by the
>>>> time I went to look for it.  A good reason never to log into the
>>>> vt0 device.  Is there any way to get the "transient" boot-time fsck
>>>> and other messages recorded in the log?
>>>
>>> There is an easy explanation:
>>>
>>> The foregroud fsck at boot time can only handle a subset of damages.
>>> In some cases, you are required to perform a second run of fsck in
>>> order to fix problems. This is where a forced full fsck is very
>>> useful (usually in single-user mode).
>>>
>>> You can specify additional flags for boot-time fsck via /etc/rc.conf,
>>> which are:
>>>
>>> fsck_y_enable="NO"      # Set to YES to do fsck -y if the initial
>>> preen fails. fsck_y_flags=""         # Additional flags for fsck -y
>>> background_fsck="YES"   # Attempt to run fsck in the background where
>>> possible. background_fsck_delay="60" # Time to wait (seconds) before
>>> starting the fsck.
>>>
>>> For example, fsck_y_flags="-f" would be such an addition. As you can
>>> see, an initial preen ("limited fsck") can fail, and the filesystem
>>> might be in an inconsistent state. This is probably what you've been
>>> experiencing.
>>>
>>> See "man fsck" for details. :-)
>>
>> My language skills must be degenerating along with everything else... :-(
> 
> My language skills aren't better either. ;-)
> 
> 
> 
>> I have already read and reread the fsck man page, but thanks for the
>> fsck_y_flags example.  I'm fairly sure a normal boot fsck should
>> not have succeeded.
> 
> I would have thought the same, but then I decided to consult
> the authoritative source: the source. If I read everything
> correctly (and there is sufficient doubt I do!), fsck will
> exit with 0 in case a re-run is required. This requirement
> is indicated by a text message ("please re-run fsck"), but
> ony the return code matters to /etc/rc's "next steps".
> 
> So in /usr/src/sbin/fsck_ffs/main.c, we find a function
> called checkfilesys() returning int, but this is discarded
> with (void); the main() function returns an int called ret,
> it is initialized 0 and only set = 2 in case of "go to
> single user mode", declared in fsck.h, set in fsutil.c
> by catchquit(), which seems to be a signal handler...
> 
> So my assumption could be correct that fsck "false-positively"
> returns 0, boot continues as normal (with mount -w), but the
> file system is still in an inconsistent state...
> 
> 
> 
>> According to the handbook, 12.2.4, if an fsck fails it should drop into
>> single user.
> 
> Definitely. This usually happens in case of severe errors where
> a repair attempt could do more damage.
> 
> 
> 
>> However, when I manually
>> unmounted the file system, ran "fsck -f" (which corrected numerous errors),
>> then rebooted, everything was (not surprisingly) once again functioning
>> properly.
> 
> That matches my assumption. As soon as the file system is in a
> consistent state again, things work as intended.
> 
> 
> 
>> So I'm looking for, if possible:
>>
>> 1. An explanation for the above behavior, which seems inconsistent with
>> the documented and expected behavior.  The only fsck flag set in rc.conf
>> is 'background_fsck="NO"'.  Is there some state a disk (or something else,
>> such as a normal shutdown flag), can (however theoretically) be in, where
>> it is possible to have a corrupt disk that won't pass normal boot time
>> fsck in preen mode but will not be checked in the first place, even while
>> another disk, the one containing all the system files, is checked?
> 
> I think we have at last an entry point for explanation now. :-)
> 
> 
> 
>> 2. A way to get the output of boot-time fsck commands recorded in the
>> system log, so one can after the fact of a reboot check to see what the
>> heck went on in terms of the fsck sequence?
> 
> This is usually the text mode console where you can press the
> Scroll Lock key and scroll up to view the message that is still
> in the text scroll buffer. As far as I know, /var/log/messages
> does not record fsck status messages.
> 
> 
> 
>> On 12/12/17 11:54, Adam Vande More wrote:
>>> On Tue, Dec 12, 2017 at 12:30 PM, Gary Aitken
>>> <freebsd@dreamchaser.org <mailto:freebsd@dreamchaser.org>> wrote:
>>>
>>>> The ordering part I understand; what I don't understand is why it (as
>>>> I recall) rebooted successfully with no warnings in spite of the
>>>> background_fsck="NO" being set and when one of the disks apparently
>>>> didn't fsck properly.  I thought it should have halted in
>>>> single-user mode and waited for me to do a full fsck manually.
>>>
>>> That happens if the preen fails.  See the man page I pointed you to.
>>> There are cases where it can miss things.
>>
>> Are you saying if the preen fails, it mounts the file system normally
>> and continues to boot into multi-user?  According to the man page for
>> fsck, a failure in preen mode will exit with failure, and the rc.d/fsck
>> script (by my incompetent reading) will do stop_boot in that case.
> 
> See above: I think it really is the case...

Thank you for the extra sleuthing work; at least maybe it now makes sense.
I owe you a beer...

What version of source are you looking at?  I'm on 10.3, and mine says:

int
main(int argc, char *argv[]{
...
         while (argc > 0) {
                 if (checkfilesys(*argv) == ERESTART)
                         continue;
                 argc--;
                 argv++;
         }

         if (returntosingle)
                 ret = 2;
         exit(ret);

It looks like in this version the return value of checkfilesys is checked,
but generally ignored since ERESTART is -1 and most returns are >= 0.
If a restart is required it tries to run again, which I believe would
still fail in this case; looks to me like it would retry 10 times and then
exit 0, so not good there either.  returntosingle appears to only be set
from catchquit, which is only called if the -C flag is set.  So overall the
same behavior you surmise.

I presume I should file a PR on this, unless you want to do it.

Thanks again,

Gary



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2d7c5545-d87d-6733-f85e-a53921afa67a>