Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 29 Jul 2017 14:13:30 -0700
From:      Kirk McKusick <mckusick@mckusick.com>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        Andreas Longwitz <longwitz@incore.de>, freebsd-fs@freebsd.org
Subject:   Re: ufs snapshot is sometimes corrupt on gjourneled partition
Message-ID:  <201707292113.v6TLDUaw093863@chez.mckusick.com>
In-Reply-To: <20170729152444.GB1700@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
> Date: Sat, 29 Jul 2017 18:24:44 +0300
> From: Konstantin Belousov <kostikbel@gmail.com>
> To: Andreas Longwitz <longwitz@incore.de>
> Cc: Kirk McKusick <mckusick@mckusick.com>, freebsd-fs@freebsd.org
> Subject: Re: ufs snapshot is sometimes corrupt on gjourneled partition
> 
> On Sat, Jul 29, 2017 at 12:38:13PM +0200, Andreas Longwitz wrote:
> 
>> hello Kirk and Kostik,
>> 
>> you are both right.
>> 
>>> So, I really do not understand how you can end up with a buffer with
>>> invalid contents if B_CACHE is set because we are open coding bread()
>>> here and that is the criterion it uses to read. If however, you do
>>> the read when B_CACHE is not set, then as noted you will get the
>>> wrong results and crash. But see also my comments below when B_CACHE
>>> is not set.
>> 
>> The patch
>> 
>> -- ffs_snapshot.c.1st  2016-06-08 17:25:21.000000000 +0200
>> +++ ffs_snapshot.c      2017-07-29 10:51:34.682067000 +0200
>> @@ -1403,7 +1403,7 @@
>>          */
>>         bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0);
>>         bp->b_blkno = fsbtodb(fs, blkno);
>> -       if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 &&
>> +       if ((bp->b_flags & (B_DONE | B_DELWRI | B_CACHE)) == 0 &&
>>             (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) {
>>                 brelse(bp);
>>                 return (error);
>> 
>> is correct, the buffer has always the correct data, when B_CACHE is set
>> after calling getblk(). So the calls to readblock() are unnecessary in
>> this case. In my first test the crash of following rm probably was
>> induced by other readblock() calls, but I did not know at that time.
> 
> Ok.  In fact it is not clear to me why B_DONE or B_DELWRI is tested at all
> there.  We only need to know that the buffer content is valid.

Historically bdwrite() did not set B_CACHE because it was not
safe/true for NFS. Under the senario where B_CACHE is not set
and a new disk block is allocated by the filesystem, it calls
getblk() to get a buffer in which to place it. The usual case is
that the block is not already be in the cache so the B_CACHE flag
is not set. The filesystem fills in the new contents then bdwrite()s
the buffer which results in B_DELWRI being set. So now, if you find
that buffer when doing a getblk(), its only valid contents are what
are in the buffer so reading the block at which it will be written
would destroy the only valid copy. Hence the check for B_DELWRI as
well as B_CACHE. The check for B_DONE is just to save an unneeded
read as it can be set after a B_DELWRI has been written (so B_DELWRI
is replaced with B_DONE). Since it is still a newly created block,
it has not yet been read so B_CACHE is not set, but the contents
are valid and can be used.

All that said, NFS got overhauled by alc@ in -r46349 in May 1999
so that it became safe to set B_CACHE in bdwrite(). Since that
change the above checks for B_DONE | B_DELWRI are no longer needed.

Since this is a bug / performance enhancement in snapshot code,
and has now been reviewed and its correctness verified, I propose
to check in this change:

Index: sys/ufs/ffs/ffs_snapshot.c
===================================================================
--- sys/ufs/ffs/ffs_snapshot.c	(revision 321679)
+++ sys/ufs/ffs/ffs_snapshot.c	(working copy)
@@ -1394,7 +1394,7 @@
 	 */
 	bp = getblk(cancelvp, lbn, fs->fs_bsize, 0, 0, 0);
 	bp->b_blkno = fsbtodb(fs, blkno);
-	if ((bp->b_flags & (B_DONE | B_DELWRI)) == 0 &&
+	if ((bp->b_flags & B_CACHE) == 0 &&
 	    (error = readblock(cancelvp, bp, fragstoblks(fs, blkno)))) {
 		brelse(bp);
 		return (error);

Let me know if you have any concerns about it.

>>> So the way to fix the bug is to read gjournal code and understand why
>>> does it sometime returns wrong data.
>> 
>> Yes, gjournal is broken in handling his flush_queue. If we have 10 bio's
>> in the flush_queue:
>>          1 2 3 4 5 6 7 8 9 10
>> and another 10 bio's goes to the flush queue before all the first 10
>> bio's was removed from the flush queue we should have something like
>>          6 7 8 9 10 11 12 13 14 15 16 17 18 19 20,
>> but because of the bug we end up with
>>          6 11 12 13 14  15 16 17 18 19 20 7 8 9 10.
>> So the sequence of the bio's is damaged in the flush queue (and
>> therefore in the journal an disk !). This error can be triggered by
>> ffs_snapshot(), when a block is read with readblock() and gjournal finds
>> this block in the broken flush queue before he goes to the correct
>> active queue.
>> 
>> The following patch solved this problem for me:
>> 
>> --- g_journal.c.orig     2017-05-08 14:16:42.000000000 +0200
>> +++ g_journal.c          2017-07-28 21:25:58.891881000 +0200
>> @@ -1261,7 +1261,7 @@
>>         strlcpy(hdr.jrh_magic, GJ_RECORD_HEADER_MAGIC,
>> sizeof(hdr.jrh_magic));
>> 
>>         bioq = &sc->sc_active.jj_queue;
>> -       pbp = sc->sc_flush_queue;
>> +       GJQ_LAST(sc->sc_flush_queue, pbp);
>> 
>>         fbp = g_alloc_bio();
>>         fbp->bio_parent = NULL;
>> 
>> with macro GJQ_LAST defined in g_journal.h:
>> 
>> #define GJQ_LAST(head, bp)      do {                             \
>>   struct bio *_bp;                                               \
>>                                                                  \
>>   if ((head) == NULL) {                                          \
>>        (bp) = (head);                                            \
>>        break;                                                    \
>>   }                                                              \
>>   for (_bp = (head); _bp->bio_next != NULL; _bp = _bp->bio_next);\
>>   (bp) = (_bp);                                                  \
>> } while (0)
>> 
>> Some more annotations about g_journal.c:
>> 
>> - The patch given in bug 198500 is necessary to run gjournal with
>> correct and adequate cache size.
>> - The cache size for gjournal can be set by two tunables: cache.limit
>> and cache.divisor. I do not know the best practice if there are two
>> variables for setting the same resource. At the moment cache.limit is
>> ignored at boot time, because it will be overwritten in g_journal_init.
>> - In g_journal_flush() there is a writeonly variable "size".
>> - If one process writes synchron a block and onother process reads the
>> same block just after the write starts, then it is possible that the
>> reader gets the new block before the writer has finished his write call.
>> This happens, when the write goes to the delayed_queue for some time
>> from where the read gets the data. But the write must wait until the bio
>> goes to the current queue, where g_io_deliver() is called. If this is
>> not intended, then the delayed queue should not be used for reads.
> 
> I recomment you to contact pjd@freebsd.org directly about both the gjournal
> patch and WRT further observations you have.

I concur that Pawel should review your proposed change. I'll send
your note along to him requesting a review. Assuming he is in agreement
he will likely check it in himself. Or if he prefers, I'll check it in.
Hopefully he will have some thoughts on your other comments.

Thank-you Andreas for taking the time to drill down and figure out why
this was failing.

	Kirk McKusick



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201707292113.v6TLDUaw093863>