Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 18 Mar 2015 19:05:38 -0800
From:      "CK" <nibbana@gmx.us>
To:        <freebsd-questions@freebsd.org>
Subject:   Re: thrashing + lost files
Message-ID:  <0MTkBS-1YyrZe40hM-00QVB7@mail.gmx.com>

next in thread | raw e-mail | index | archive | help
> > > > The result is the loss of many critical files from a hard drive, as if
> > > > a "rm *" was done in the home directory.  This occurs after the
> > > > thrashing when Xwindow is accidently shutdown with Opera open with
> > > > many javascript page tabs, eg, being a memory pig - consuming 1/2 of
> > > > RAM (256M), which after dumping core, writes a large amount of data
> > > > (crashlog) even after Xwindow is down:
> > > >
> > > > pid 1118 (opera), uid 1001: exited on signal 11 (core dumped)
> > >
> > > I thought Opera would simply write a core dump, well, still several 100s
> > > of MB though...
> >
> > Interestingly, the core dump was deleted out of the home directory. I
> > caught a quick glimpse of it doing "ls" before it was deleted. As I said,
> > it was exactly like "rm *".  Dot files were left intact.
>
> Oh, that's surprising! I also had that experience once - home directory
> empty (!) _except_ dot files (and other directories), just like "rm *" had
> been issued... very strange...

Yes, that is interesting.  Does not see like "coincidence".

> > At first, I thought it was a bug with journaling/soft-updates, so I
> > disabled those things with tunefs (to the best of my memory).  But now it
> > has happened again.
>
> I can't imagine it has to do with that. Massive file loss can appear when a
> directory inode has been damaged. Then fsck will remove the directory
> altogether. But it's possible to rescue the files _content_, as those are
> written with their (orphan) inode number to lost+found/. So their names are
> lost, but their content will be kept.

I turned off journaling and soft-updates because the first time this problem
occurred, it deleted the files in my home directory, as well as user-owned
files in /home/tmp and /home/user/subdir's that were recently created, so I
thought maybe they weren't being flushed out to the disk; eg, getting stuck in
some journaling/soft-updates buffer.

> > The drive was being written to for about 1 minute by the Opera
> > crashlog/coredump.  About 45 seconds after Xwindow was already down.
>
> Such kind of crash indicates a significant problem. Are you
> sure the drives are fully intact? Check with "smartctl -a" just
> to be sure. And even if it sounds stupid: check the cables.

smartctl 6.2 2014-02-18 r3874 [FreeBSD 9.2-RELEASE i386] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar WDxxxAB
Device Model:     WDC WD400AB-22CDB0
Serial Number:    WD-WMA9T1222658
Firmware Version: 22.04A22
User Capacity:    40,020,664,320 bytes [40.0 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA/ATAPI-5 (minor revision not indicated)
Local Time is:    Wed Mar 18 17:40:59 2015 AKDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:(0x82) Offline data collection activity
                                      was completed without error.
                                      Auto Offline Data Collection: Enabled.
Self-test execution status:    (   0) The previous self-test routine completed
                                      without error or no self-test has ever
                                      been run.
Total time to complete Offline
data collection:               (2376) seconds.
Offline data collection
capabilities:                  (0x3b) SMART execute Offline immediate.
                                      Auto Offline data collection on/off
                                      support.
                                      Suspend Offline collection upon new
                                      command.
                                      Offline surface scan supported.
                                      Self-test supported.
                                      Conveyance Self-test supported.
                                      No Selective Self-test supported.
SMART capabilities:          (0x0003) Saves SMART data before entering
                                      power-saving mode.
                                      Supports SMART auto save timer.
Error logging capability:      (0x01) Error logging supported.
                                      No General Purpose Logging support.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (  42) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG  VALU WORSTHRESH TYPE    UPDATED WHEN_
FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b 200 200 051   Pre-fail Always  - 0
  3 Spin_Up_Time            0x0007 102 099 021   Pre-fail Always  - 3975
  4 Start_Stop_Count        0x0032 100 100 040   Old_age  Always  - 58
  5 Reallocated_Sector_Ct   0x0033 199 199 140   Pre-fail Always  - 1
  7 Seek_Error_Rate         0x000b 200 200 051   Pre-fail Always  - 0
  9 Power_On_Hours          0x0032 084 084 000   Old_age  Always  - 12324
 10 Spin_Retry_Count        0x0013 100 253 051   Pre-fail Always  - 0
 11 Calibration_Retry_Count 0x0013 100 253 051   Pre-fail Always  - 0
 12 Power_Cycle_Count       0x0032 100 100 000   Old_age  Always  - 57
196 Reallocated_Event_Count 0x0032 199 199 000   Old_age  Always  - 1
197 Current_Pending_Sector  0x0012 200 200 000   Old_age  Always  - 0
198 Offline_Uncorrectable   0x0012 200 200 000   Old_age  Always  - 0
199 UDMA_CRC_Error_Count    0x000a 200 253 000   Old_age  Always  - 0
200 Multi_Zone_Error_Rate   0x0009 200 200 051   Pre-fail Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

Selective Self-tests/Logging not supported


smartctl 6.2 2014-02-18 r3874 [FreeBSD 9.2-RELEASE i386] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar AC
Device Model:     WDC AC24300L
Serial Number:    WD-WT4111658721
Firmware Version: 14.10R11
User Capacity:    4,311,982,080 bytes [4.31 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA/ATAPI-4 (minor revision not indicated)
Local Time is:    Wed Mar 18 17:41:00 2015 AKDT
SMART support is: Ambiguous - ATA IDENTIFY DEVICE words 85-87 don't show if
                  SMART is enabled.
                  Checking to be sure by trying SMART RETURN STATUS command.
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:(0x00) Offline data collection activity
                                      was never started.
                                      Auto Offline Data Collection: Disabled.
Total time to complete Offline
data collection:               (1280) seconds.
Offline data collection
capabilities:                  (0x03) SMART execute Offline immediate.
                                      Auto Offline data collection on/off
                                      support.
                                      Suspend Offline collection upon new
                                      command.
                                      No Offline surface scan supported.
                                      No Self-test supported.
                                      No Conveyance Self-test supported.
                                      No Selective Self-test supported.
SMART capabilities:          (0x0002) Does not save SMART data before
                                      entering power-saving mode.
                                      Supports SMART auto save timer.
Error logging capability:      (0x00) Error logging NOT supported.
                                      No General Purpose Logging support.

SMART Attributes Data Structure revision number: 5
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG  VALU WORSTHRESH TYPE    UPDATED WHEN_
FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b 200 200 051   Pre-fail Always  - 0
  4 Start_Stop_Count        0x0012 099 099 040   Old_age  Always  - 1545
  5 Reallocated_Sector_Ct   0x0013 200 200 001   Pre-fail Always  - 0
 10 Spin_Retry_Count        0x0013 100 100 051   Pre-fail Always  - 0
 11 Calibration_Retry_Count 0x0013 100 100 051   Pre-fail Always  - 0
199 UDMA_CRC_Error_Count    0x000a 200 200 000   Old_age  Always  - 51375
200 Multi_Zone_Error_Rate   0x0009 100 253 051   Pre-fail Offline - 0

SMART Error Log not supported
SMART Self-test Log not supported
Selective Self-tests/Logging not supported

> > > > FSCK RESULTS:
> > > > ------------
> > > > Of interest, is that each time fsck was run, more files were lost!
> > > >
> > > > # fsck -t ufs -p /dev/ada0p6.eli
> > > > /dev/ada0p6.eli: NO WRITE ACCESS
> > > > /dev/ada0p6.eli: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
> > >
> > > This message should alert you. Don't just preen the disk.
> > > In this mode, only a subset of errors will be detected,
> > > and not all of them can be corrected. You should actually
> > > perform
> > >
> > > 	# fsck -t ufs -f /dev/ada0p6.eli
> >
> > Thanks, I didn't think of using the -f option.
>
> The -f options *f*orces a *f*ull check. You can even run the command two
> times. The 2nd run should then reveal "no errors", the file system is kept
> marked clean.
>
> > After reading a paper by Marshall McKusick on fsck, it was my
> > understanding that "preen mode" only fixed errors that could be fixed with
> > 100% accuracy.
>
> I also read that famous paper to gain a better understanding of how UFS
> works and what fsck does. Data loss teaches you a lot of fundamental
> knowledge. :-)
>
> > > There are several errors shown:
> > >
> > > > INCORRECT BLOCK COUNT I=2327435 (8 should be 0)
> > > > [...]
> > > > UNREF FILE I=2327428  OWNER=abc MODE=100600
> > > > [...]
> > > > UNREF FILE I=2327439  OWNER=abc MODE=100600
> > > > [...]
> > > > FREE BLK COUNT(S) WRONG IN SUPERBLK
> > > > [...]
> > > > SUMMARY INFORMATION BAD
> > > > [...]
> > > > BLK(S) MISSING IN BIT MAPS
> >
> > I lost about 8 files, a lot of legal research/work, in case that is what
> > the (8 should be 0) is citing.
>
> The question is: Is the data still there? Just because the file is gone -
> the inode entry -, this does not have to imply that the data isn't still on
> the disk. Everything is on the disk as long as it hasn't been overwritten.
>
> When I found out that one of my files (which I worked a whole day on) was
> gone (0 bytes) after a freeze + reboot + fsck, I immediately forced a r/o
> mount on the /home partition and grepped for some text fragment I could
> remember. I found the block where it was in, dumped that block, and trimmed
> it to become the original file again. The data wasn't lost, it was fully
> intact. But not referenced (!) anymore.
>
> > > Unmount the partition, let fsck do its job. :-)
> >
> > fsck -t ufs -f /dev/ada0p6.eli only reported that everything was clean.
>
> So at _this_ point in time the file system was consistent. Do you maybe have
> background_fsck="YES" in /etc/rc.conf? Set it to ="NO". Always perform file
> system checks _prior_ to accessing a file system r/o or even r/w. This may
> take some time, but you have to find a relation of time vs. data that
> reflects your priorities. :-)

No, I do not have background fsck's - and I never rebooted. I always run fsck
before mounting a file-system. I have my own /etc/rc that is self-contained
with a few lines to bring the system up, and nothing more, very much a
minimialist, essentially:

    /sbin/geli onetime -d -e 3des -s 4096   /dev/ada0p3
    /sbin/swapon                            /dev/ada0p3.eli
    /sbin/fsck  -y
    /sbin/mount -a
    /bin/rm -rf /var/run/* /var/spool/lock/*
     umask 0077 # rw- --- ---
    /bin/hostname localhost...
    /sbin/ldconfig /usr/lib /usr/local/lib /usr/X11/lib
    /sbin/ifconfig lo0  127.0.0.1

> > > Copy files to a different disk (or maybe even external storage, such as
> > > USB sticks) temporarily, just to be sure.
> >
> > Yes, I do this of course, with a USB SDRAM device. But I still lose days
> > of work, because I can't back up every minute.
>
> You could automate this - but on the other hand, when a crash appears, this
> might also affect the backup process and its results.
>
> > This should not happen at all.
>
> Yes, it sounds too unusual.
>
> > I have used FreeBSD for 20 years, since 1995, and I never had problems
> > like this before - and I have the same hardware since 2003, which I ran
> > FreeBSD 4.11 on until recently.  But only now does this problem occur.
> > Certainly, there is a bug somewhere.  My gut feeling is that something is
> > allowing Opera to do things it should not do, or something in the
> > filesystem layers is breaking under the stress of Opera's crash dumps.
>
> I'd think it's somewhere filesystem-related. I have tortured Opera with
> approx. 100 tabs open with "Flash" content and JS stuff in it. No crash, it
> just started swapping heavily. Sometimes I can get Opera to crash, but it
> successfully "resumes". However, when my system freezes (due to a faulty
> GPU) and Opera has been running. sometimes the bookmarks are lost. That's
> why I tend to copy them to ~/ from time to time, just to be sure. In few
> cases, the Opera settings also are reset. A copy of ~/.opera is helpful.
> Maybe it's just program design that got worse, like first reading a file
> into memory, then keeping that file open, maybe modify it, or not, and upon
> program exit, write memory content back to the file. When the normal program
> termination is not reached, a damaged or empty file is left behind. I have
> no idea what makes people write software that way, but it seems to be
> "modern" now...

We're in the 2nd half of the "peak usury-based-civilization" bell curve :) I
was 30 when I started using FreeBSD, now 50. Of all things in life, FreeBSD
was likely the greatest pleasure and best experience. I could easily enjoy
doing much more with it for 100s of years, and it's been one of the few social
circles where I've met people that I admire and respect for the development of
their faculties and all-around good+honest+intelligent nature.

> Polytropon
> Magdeburg, Germany
> Happy FreeBSD user since 4.0
> Andra moi ennepe, Mousa, ...




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?0MTkBS-1YyrZe40hM-00QVB7>