Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 27 Aug 2018 13:54:30 -0400
From:      David Cross <dcrosstech@gmail.com>
To:        FreeBSD Hackers <freebsd-hackers@freebsd.org>
Subject:   Re: Weird USB DA behavior (was: Re: weird geli behavior)
Message-ID:  <CAM9edeOfZJkDHc9Djerqmpq%2BZ81yTtTKrszhs4joqspgz1T7iQ@mail.gmail.com>
In-Reply-To: <CAM9edeP4wTeEpPfpTYBsn1YjOVh2a%2BwYR-OL1Tv0Vv4LXPNUgw@mail.gmail.com>
References:  <CAM9edePfxANDxXAjgQsZPXzPc3Ezw4Pn%2BdaVcnkaHx1oY%2BUoDA@mail.gmail.com> <20180825010023.GD45503@funkthat.com> <CD43A15C-74B2-4F29-ADB5-B831A0CD5BF6@gmail.com> <20180825052330.GE45503@funkthat.com> <CAM9edeP4wTeEpPfpTYBsn1YjOVh2a%2BwYR-OL1Tv0Vv4LXPNUgw@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Aug 25, 2018 at 4:22 PM David Cross <dcrosstech@gmail.com> wrote:

> Top-posting because this just took a radical change in direction
>
> Ok, I need to take some of this back; I do still believe that there is
> some weird GELI-ness going on (or perhaps GEOM itself), in terms of hangs,
> but it may be being tripped over by some weird USB-DA behavior I noticed in
> trying to track this down.
>
> Ok... I think I found the smoking gun, I think between all of the multiple
layers of GEOM (geli->mirror->intermediate physical->physical) and
something that postgres is doing its filling up the 'transient arena'; I
even found a few pause()s in geom_io.c that seem to correspond to what I am
witnessing.

Also there's:
kern.geom.transient_map_soft_failures: 356796
kern.geom.transient_map_hard_failures: 35650

My *guess* as to what is going on is that there is a MASSIVE spike of IO
operations triggered by postgres doing 'something' with the the data on
disk to close out a transaction, this causes a spike in mappings to ELI...
which cannot pass things off to mirror because the transient area is full..
the system then sleeps for a second seeing if something has cleared (which
it cannot because it cannot get to disk).. it then fails the request.
Eventually this bleeds down to where the queue can flow (and is where I see
the spike in operations and everything clearing up at once).

The default transient arena seems to be just 1024 pages (4M?) and is
hard-coded.  It also looks like there aren't any kernel messages at the
default level when you hit this (or if they are they seem to get eaten by
the fact that nothing can log... because, well... IO is deadlocked).

I won't be able to test this until this evening when I can bump this up...
any reason to not set this to 10240 or higher?  its a 64bit machine with
72G of RAM.

Longer term, if this is the cause, and with the inherent nesting nature of
GEOM I think we need a way to handle this better.

Next up: figure out why random 20480 byte reads are choking on this
machine.. I tried the earlier mentioned pread test on a different machine
with the EXACT same model external array, and it worked perfectly.  so its
either the machine or the drives...  more experimenting to do!



> I turned debugging ALL the way up on geli (3) and noticed the hangs always
> happened when geli handed off a 20480 length read to the layer below (in
> this case mirror).. I used the rebugging output to create a dummy program
> called 'pread' to try to simulate these failures, and I was quite
> successful.  Attached is 'pread.c' which takes on stdin a offset and a
> length to read.  In that version I overwrote it to always ask for 32k., and
> that works  every time.  If I eliminate that line and let the input data
> control it, on some of the 20480 (and a different set each time) it hangs,
> after the hang the pread returns _0_ (and obviously no errno), no kernel
> logs.  "pread" is run directly against /dev/da0, so no mirror or geli
> layers at this point.
>
>

> Kernel attach messages of relevance:
> da0 at umass-sim0 bus 0 scbus7 target 0 lun 0
> da0: <WD Elements 107C 1065> Fixed Direct Access SPC-4 SCSI device
> da0: Serial Number 574343344532485548465239
> da0: 40.000MB/s transfers
> da0: 3815415MB (976746240 4096 byte sectors)
> da0: quirks=0x2<NO_6_BYTE>
>
> Relevant output of a pread run:
> ~ # ./pread /dev/da0 < requestdata.txt
> 117535768576(20480)=0(0)
> 189095018496(20480)=0(0)
>
> (last 0 is a trivial patch to include errno output)
>
> On Sat, Aug 25, 2018 at 1:23 AM John-Mark Gurney <jmg@funkthat.com> wrote:
>
>> David Cross wrote this message on Fri, Aug 24, 2018 at 22:42 -0400:
>> >
>> >
>> > > On Aug 24, 2018, at 21:00, John-Mark Gurney <jmg@funkthat.com> wrote:
>> > >
>> > > David Cross wrote this message on Fri, Aug 24, 2018 at 17:54 -0400:
>> > >> Ok, I am seeing something truely bizzare, I am sending this out as a
>> shot
>> > >> across the bow since I am not even sure where or how to begin
>> debugging
>> > >> this.
>> > >>
>> > >> Some background.  This in on an Intel Xeon 5520 based machine, 72G
>> ECC
>> > >> memory, 11.2, fully patched.  Though this has been a problem since
>> at least
>> > >> 11.1, probably 11.0, and maybe earlier. ~4G of eli encrypted swap,
>> which it
>> > >> basically never even touches, even when problems are occuring)
>> > >
>> > > I assume you've applied the lazyfpu SA patch?
>> > >
>> > > If so, there's another patch you need to apply, see:
>> > >
>> https://docs.freebsd.org/cgi/mid.cgi?20180821081150.GU2340@kib.kiev.ua
>> > >
>> > I will definitely apply this, but I don???t think it applies to the
>> problem in question. This system doesn???t have AESNI, this problem well
>> preceded the lazyfpu patch, and i am not seeing any corruption on disk.
>>
>> Hmm, k...
>>
>> > >> The first symptom was (and I think these are all aspects of the same
>> root
>> > >> underlying cause) that fsck on a geli encrypted d stripe of 2 USB
>> drives
>> > >> would *randomly* error out on a corrupt entry.  Upon investigating
>> this I
>> > >> discovered by watching gstat that as this happened the IO on the
>> drives
>> > >> would STOP.  the L(q) would hover at 1 for a number of seconds, and
>> then
>> > >> when it returned fsck was complaining about various corrupt
>> structures. a
>> > >> ktrace of fsck shows that it got back data from the pread() that was
>> > >> partially corrupted (I am guessing, but I cannot confirm that 'some
>> part'
>> > >> of the stack handed back a zeroed page, or otherwise 'not the right
>> data'
>> > >> that geli dutifully 'decrypted'.  No errors are ever logged in the
>> kernel
>> > >> about da0 or da1 (the respective underlying USB disks). It *seems*
>> this is
>> > >> *always* on phase 2 of fsck (files and paths), and its never the same
>> > >> inode.  no data is *ever* corrupted when in the filesystem, no
>> matter how
>> > >> hard I hit the disks (all data on these devices is fully checksummed)
>> > >> Devices have passed multiple SMART full diag checks, full read/write
>> tests
>> > >> with no issues.  Under heavy FS IO it does occasionally lock.. but
>> > >> recovers, and again data and filesystem are fully consistent.
>> > >>
>> > >> I was willing to live with that.. weird as it was (these are backup
>> disks,
>> > >> data is fully checksummed, and I was only fscking out of extreme
>> paranoia
>> > >> every reboot)  Then I added an internal drive, configured with
>> gmirror
>> > >> (broken mirror currently, second disk hasn't been added) and geli.
>> On this
>> > >> disk I have a postgres 10 database in WAL replication.  This was
>> working
>> > >> fine and then the other day the system just locked for a few hours.
>> During
>> > >> that time I saw the L(q) of the _internal_ disk in the 10,000+
>> range, and
>> > >> it doing _1_ operation a second to the underlying disk... all the
>> while
>> > >> geli is logging 'error 11' to the console (nothing about the
>> underlying
>> > >> disk)  After this happened a static file on the disk (a zip file)
>> had bad
>> > >> data in the middle of a page  (after reboot the file was ok.. so it
>> was
>> > >> just in cache).  Again, this disk fully checks ok, no corruption on
>> the
>> > >> disk, no errors from the disk itself.
>> > >>
>> > >>
>> > >> Halp?  where do I even begin with this?   It really feels like there
>> is
>> > >> some massive locking going on in geli in some way?  Where should I
>> even
>> > >> begin looking?  I run geli on most of my systems and don't have any
>> issues.
>>
>> Can you post actual log lines?  geli has lots of error log lines, so w/o
>> more info, pretty hard to say WHAT in geli is returning EAGAIN.  I do see
>> that _read_done and _write_done may not handle an EAGAIN error, which
>> could
>> cause this problem, but to confirm, I need the actual log lines...
>>
>> --
>>   John-Mark Gurney                              Voice: +1 415 225 5579
>>
>>      "All that I will do, has been done, All that I have, has not."
>>
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAM9edeOfZJkDHc9Djerqmpq%2BZ81yTtTKrszhs4joqspgz1T7iQ>