Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 14 Nov 2020 13:39:32 -0700
From:      Warner Losh <imp@bsdimp.com>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        Scott Long <scottl@samsco.org>, Alexander Motin <mav@freebsd.org>,  "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>
Subject:   Re: MAXPHYS bump for FreeBSD 13
Message-ID:  <CANCZdfroqvHhwBxhbB1SqcZgM6o_pcAFxWqPQ3Y4Z%2BZqd1FCTA@mail.gmail.com>
In-Reply-To: <X7AzRMZ5FSRpzXqo@kib.kiev.ua>
References:  <aac44f21-3a6d-c697-bb52-1a669b11aa3b@FreeBSD.org> <X7Aj0a6ZtIQfBscC@kib.kiev.ua> <634E35E9-9E2B-40CE-9C70-BB130BD9D614@samsco.org> <X7AzRMZ5FSRpzXqo@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Nov 14, 2020, 12:43 PM Konstantin Belousov <kostikbel@gmail.com>
wrote:

> On Sat, Nov 14, 2020 at 11:48:34AM -0700, Scott Long wrote:
> >
> >
> > > On Nov 14, 2020, at 11:37 AM, Konstantin Belousov <kostikbel@gmail.co=
m>
> wrote:
> > >
> > > On Sat, Nov 14, 2020 at 10:01:05AM -0500, Alexander Motin wrote:
> > >> On Fri, 13 Nov 2020 21:09:37 +0200 Konstantin Belousov wrote:
> > >>> To put the specific numbers, for struct buf it means increase by 17=
92
> > >>> bytes. For bio it does not, because it does not embed vm_page_t[]
> into
> > >>> the structure.
> > >>>
> > >>> Worse, typical struct buf addend for excess vm_page pointers is goi=
ng
> > >>> to be unused, because normal size of the UFS block is 32K.  It is
> > >>> going to be only used by clusters and physbufs.
> > >>>
> > >>> So I object against bumping this value without reworking buffers
> > >>> handling of b_pages[].  Most straightforward approach is stop using
> > >>> MAXPHYS to size this array, and use external array for clusters.
> > >>> Pbufs can embed large array.
> > >>
> > >> I am not very familiar with struct buf usage, so I'd appreciate some
> > >> help there.
> > >>
> > >> Quickly looking on pbuf, it seems trivial to allocate external b_pag=
es
> > >> array of any size in pbuf_init, that should easily satisfy all of pb=
uf
> > >> descendants.  Cluster and vnode/swap pagers code are pbuf descendant=
s
> > >> also.  Vnode pager I guess may only need replacement for
> > >> nitems(bp->b_pages) in few places.
> > > I planned to look at making MAXPHYS a tunable.
> > >
> > > You are right, we would need:
> > > 1. move b_pages to the end of struct buf and declaring it as flexible=
.
> > > This would make KBI worse because struct buf depends on some debuggin=
g
> > > options, and than b_pages offset depends on config.
> > >
> > > Another option could be to change b_pages to pointer, if we are fine
> with
> > > one more indirection.  But in my plan, real array is always allocated
> past
> > > struct buf, so flexible array is more correct even.
> > >
> >
> > I like this, and I was in the middle of writing up an email that
> described it.
> > There could be multiple malloc types or UMA zones of different sizes,
> > depending on the intended i/o size, or just a runtime change to the siz=
e
> of
> > a single allocation size.
> I do not think we need new/many zones.
>
> Queued (getnewbuf()) bufs come from buf_zone, and pbufs are allocated
> from pbuf_zone. That should be fixed alloc size, with small b_pages[]
> for buf_zone, and large (MAXPHYS) for pbuf.
>
> Everything else, if any, would need to pre-calculate malloc size.
>

How will this affect clustered reads for things like read ahead?

>
> > > 2. Preallocating both normal bufs and pbufs together with the arrays.
> > >
> > > 3. I considered adding B_SMALLPAGES flag to b_flags and use it to
> indicate
> > > that buffer has 'small' b_pages.  All buffers rotated through
> getnewbuf()/
> > > buf_alloc() should have it set.
> > >
> >
> > This would work nicely with a variable sized allocator, yes.
> >
> > > 4. There could be some places which either malloc() or allocate struc=
t
> buf
> > > on stack (I tend to believe that I converted all later places to
> formed).
> > > They would need to get special handling.
> > >
> >
> > I couldn=E2=80=99t find any places that allocated a buf on the stack or=
 embedded
> it
> > into another structure.
> As I said, I did a pass to eliminate stack allocations for bufs.
> As result, for instance flushbufqueues() mallocs struct buf, but it does
> not use b_pages[] of the allocated sentinel.
>

Yea. I recall both the pass and looking for them later and not finding any
either...

>
> > > md(4) uses pbufs.
> > >
> > > 4. My larger concern is, in fact, cam and drivers.
> > >
> >
> > Can you describe your concern?
> My opinion is that during this work all uses of MAXPHYS should be reviewe=
d,
> and there are a lot of drivers that reference the constant.  From the pas=
t
> experience, I expect some evil ingenious (ab)use.
>
> Same for bufs, but apart from application of pbufs in cam_periph.c, I do
> not
> think drivers have much use of it.
>

Do you have precise definitions for DFLTPHYS and MAXPHYS? That might help
ferret out the differences between the two. I have seen several places that
use one or the other of these that seem incorrect, but that I can't quite
articulate precisely why... having a good definition articulated would
help. There are some places that likely want a fixed constant to reflect
hardware, not a FreeBSD tuning parameter.

As an aside, there are times I want to do transfers of arbitrary sizes for
certain pass through commands that are vendor specific and that have no way
to read the results in chunks. Thankfully most newer drives don't have this
restriction, but it still comes up. But that's way below the buf layer and
handled today by cam_periph and the pbufs there. These types of operations
are rare and typically when the system is mostly idle, so low memory
situations can be ignored beyond error handling and retry in the user
program. Would this work make those possible? Or would MAXPHYS, however
set, still limit them?

Warner

>
> > >>
> > >> Could you or somebody help with vfs/ffs code, where I suppose the
> > >> smaller page lists are used?
> > > Do you plan to work on this ?  I can help, sure.
> > >
> > > Still, I wanted to make MAXPHYS (and 'small' MAXPHYS, this is not sam=
e
> as
> > > DFLPHYS), a tunable, in the scope of this work.
> >
> > Sounds great, thank you for looking at it.
> >
> > Scott
> >
> _______________________________________________
> freebsd-arch@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-arch
> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org"
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfroqvHhwBxhbB1SqcZgM6o_pcAFxWqPQ3Y4Z%2BZqd1FCTA>