Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 22 Feb 2009 13:00:52 +0200
From:      Kostik Belousov <kostikbel@gmail.com>
To:        Carl <k0802647@telus.net>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: UFS2 and/or sparse file bug causing copy process to land in 'D'' state?
Message-ID:  <20090222110052.GH41617@deviant.kiev.zoral.com.ua>
In-Reply-To: <49A10626.8060705@telus.net>
References:  <49A10626.8060705@telus.net>

next in thread | previous in thread | raw e-mail | index | archive | help

--sjel/IY1pyoUgMMX
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sun, Feb 22, 2009 at 12:00:38AM -0800, Carl wrote:
> I've come across what I'm thinking may be a bug in the context of=20
> FreeBSD 7.0 with a pair of gmirrored drives and gjournaled partitions=20
> when copying a large number of files into a file-backed memory device.
>=20
> The consequence of this problem is that a process enters the 'D' state=20
> (process in disk) indefinitely, cannot be killed, and the system cannot=
=20
> be shutdown. The only solution is to cold reboot the system, which is a=
=20
> really big problem for remote systems. This is happening to me=20
> intermittently with the standard tar-tar pipeline form of copying, but=20
> has happened with the rsync 3.0.4 port as well.
>=20
> I would appreciate it if some of you would see if you can repeat this=20
> problem. Here is a sequence of tcsh shell commands which manifest the=20
> problem (on occasion but not every time), which I will refer to as the=20
> "truncate sequence" (depends on fully populated /usr/src tree as data set=
):
>=20
>      # truncate -s 671088640 target
>      # mdconfig -f target -S 512 -y 255 -x 63 -u 7
>      # bsdlabel -w /dev/md7 auto
>      # newfs -O2 -m 0 -o space /dev/md7a
>      # mount /dev/md7a /media
>      # tar -cvf - -C /usr/src . | tar -xvpof - -C /media
>      # umount /media ; mdconfig -d -u 7 ; rm target
>=20
> An alternate version has yet to fail for me and involves replacing the=20
> first line with this one:
>=20
>      # dd if=3D/dev/zero of=3Dtarget bs=3D1M count=3D640
>=20
> I'll call that the "dd sequence". Here is an ordered series of tests I=20
> just completed:
>=20
> a) Repeated truncate sequence 7 times - 1st, 5th, and 7th failed.
> b) Repeated dd sequence 7 times - no failures.
> c) Repeated truncate sequence 6 time - no failures.
> d) Used following sequence to ensure all disk caches flushed:
>=20
>      # dd if=3D/dev/random of=3Dtarget bs=3D1M count=3D4096
>      # dd if=3Dtarget of=3D/dev/null bs=3D1M
>      # rm target
>=20
> e) Repeated truncate sequence 4 times - no failures.
> f) Performed orderly reboot.
> g) Repeated truncate sequence 2 times - 2nd failed.
> h) Performed orderly reboot.
> i) Repeated dd sequence 7 times - no failures.
>=20
> All failures involve the second tar in the pipeline hanging in the 'D'=20
> state. In each case I do a cold reboot before proceeding with the next te=
st.
>=20
> It's tempting to speculate that a bug exists in code related to handling=
=20
> sparse files specifically, but perhaps it just raises the probability of=
=20
> tripping a bug that would eventually manifest in the dd sequence as=20
> well. OTOH, I don't know how to rule out a physical disk or disk=20
> firmware problem.
>=20
> This problem has occurred with different data sets and different sized=20
> memory disks, but only with the source and destination filesystems being=
=20
> UFS2. I have done similar sequences with EXT2 and FAT16 destinations=20
> with no failures thus far, but the memory disks and data sets were=20
> smaller so it's conceivable that probability worked against me.
>=20
> I should note that the drives are Seagate ST31000340AS Barracudas, but=20
> both drives have been upgraded to firmware version SD1A and are=20
> therefore supposedly free of the infamous little horror Seagate=20
> inflicted on so many of us. smartctl tells me that both disks still have=
=20
> a raw value of 0 for Reallocated_Sector_Ct and both pass the "short"=20
> self test.

Please, see
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kernel=
debug-deadlocks.html
for instructions on how to gather the required information to diagnose
the issue.

--sjel/IY1pyoUgMMX
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (FreeBSD)

iEYEARECAAYFAkmhMGMACgkQC3+MBN1Mb4iqVACePL6IH3cjmuxS/fBbA662oa6o
1oMAnjiIFXx8lUDtxWyr9TdEWDfnF5xf
=7grU
-----END PGP SIGNATURE-----

--sjel/IY1pyoUgMMX--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20090222110052.GH41617>