FreeBSD Mail Archives

Date:      Tue, 26 Jun 2018 08:18:43 -0700
From:      bob prohaska <fbsd@www.zefox.net>
To:        Mark Millard <marklmi@yahoo.com>
Cc:        Jamie Landeg-Jones <jamie@catflap.org>, Warner Losh <imp@bsdimp.com>, freebsd-arm <freebsd-arm@freebsd.org>, bob prohaska <fbsd@www.zefox.net>
Subject:   Re: RPI3 swap experiments, was Re: GPT vs MBR for swap devices
Message-ID:  <20180626151843.GD17293@www.zefox.net>
In-Reply-To: <A6986B21-FF6E-48F5-9F3A-06B3D2A92C55@yahoo.com>
References:  <10CAC122-399D-459E-9153-ABD7E753777E@yahoo.com> <a2d7f4d3-0b6d-f82d-bae8-0988b0b54a8f@sentry.org> <20180623143218.GA6905@www.zefox.net> <03C2D3C4-6E90-4054-AF79-BD7FE2B7958D@yahoo.com> <20180624231020.GA11132@www.zefox.net> <C87C40CF-15B2-4137-892C-F2ADBAB32418@yahoo.com> <20180626052451.GA17293@www.zefox.net> <CANCZdfpXyzxzOZ8pqcRtuFsxYx5Jjs9oSL1ok2sGVPHdiB0qVQ@mail.gmail.com> <201806261040.w5QAeBKq035183@donotpassgo.dyslexicfish.net> <A6986B21-FF6E-48F5-9F3A-06B3D2A92C55@yahoo.com>

On Tue, Jun 26, 2018 at 07:37:59AM -0700, Mark Millard wrote:
> 
> 
> On 2018-Jun-26, at 3:40 AM, Jamie Landeg-Jones <jamie at catflap.org> wrote:
> 
> > Warner Losh <imp at bsdimp.com> wrote:
> > 
> >>>> _vfs_done():da0d[WRITE(offset=51819347968, length=131072)]error = 5
> >>>> g_vfs_done():da0d[WRITE(offset=51819479040, length=28672)]error = 5
> >>>> g_vfs_done():da0d[READ(offset=59586936832, length=32768)]error = 5
> >>>> g_vfs_done():vm_fault: pager read error, pid 823 (tcsh)
> >>> 
> >> 
> >> The device is broken if you get this. Period. I don't know if it is
> >> hardware, or software, but it is not a reliable storage device. Until
> >> that's fixed, you'll continue to have a terrible experience with it.
> >> 
> > 
> > [ ... ]
> > 
> >> Sorry to sound so harsh, but the data has been consistent on this for
> >> everything you've reported: it works for a while, then we get a bunch of
> >> errors then a reboot. We need to start narrowing down which of these three
> >> broad classes of root causes it is. I'd rank actual bad thumbdrive last on
> >> the list. It's a tossup for me between missing quirk and a bug in the rpi
> >> usb driver that manifests itself only under heavy load. IIRC, you said one
> >> of rpi2/3 works and the other doesn't, which would suggest a usb bridge
> >> driver problem...
> > 
> > For what it's worth, I had the same errors on a rpi3 a few months ago, and
> > eventualy gave up "to sort it tomorrow" - it hasn't been powered on since, but
> > I still want to get it working.
> > 
> > The system would run fine, but give the vfs errors on the 128GB usb thumb
> > drive every week - like clockwork, when one of the heavier periodic jobs ran.
> > 
> > I was running the latest CURRENT at the time. The thumb drive works fine elsewhere,
> > and indeed - did on the same hardware when I test installed a linux install,
> > and thrashed the hell out of it.
> > 
> > I'll fire it up again - hopefully I'll still have the same results, and with 2
> > of us, we may find the cause quicker.
> > 
> > (n.b. i never had swap errors, but I can't recall if i ever configured swap on the usb
> > drive)
> 
> The presence of the errors is a confounding variable for the other
> issues being looked into.
> 
> It would likely be better for the effort to be split:
> 
> A) Looking into the drive errors and what range of contexts
>    get them, hoping to find something to fix the issue (such
>    as by adding a quirk).
> 
> B) Looking into the swapping and Out Of Memory process killing
>    --but absent such errors being involved. (For now this might
>    require a different instance of the same type of device
>    or a different type of device.)
> 
> It seems too complicated to be investigating (B) but in a
> context with the drive errors also involved.
> 
> As I remember, Bob P. Did reproduce drive errors even without
> the problem drive being used for swapping. This too suggests
> (A) as separate activity.
> 
Indeed, it is a requirement. If the suspect device is used for swapping
OOMA kills prevent the test from progressing to the point of failure.

> If only one of the 2 is targeted first, (A) may be the
> better one to pursue for those with reproducible examples.
> 
> For those with contexts that lack the drive errors, (B)
> activity might show a contrasting behavior for lack of drive
> errors --or the behavior might be reproduced. Cross checking
> on if drive errors started showing up would be appropriate.
> 
> An intersting question for (A) might be if some drive benchmark
> program(s) might reproduce the drive errors. If such was found,
> the context for reproduction would be far simpler than buildworld
> buildkernel use.
> 

The machine exhibiting the errors has Peter Holm's stress2 test suite
installed. It compiled and ran, but apparently some art is required
to craft a test case that exercises the weak points of interest
without tripping over weak points not of interest. I'd be pleased to
try it, but will need some guidance. If there are more expedient 
things to attempt please indicate what they are. For example would
it be informative to simply try a -j3 or -j2 buildworld? In the past
-j1 buildworld has run to completion but hasn't been tried recently. 

Thanks for writing!

bob prohaska

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180626151843.GD17293>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation