Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 5 Nov 2014 05:24:17 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Brooks Davis <brooks@freebsd.org>
Cc:        Hans Petter Selasky <hps@selasky.org>, Mateusz Guzik <mjguzik@gmail.com>, jmallett@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org, Poul-Henning Kamp <phk@phk.freebsd.dk>, src-committers@freebsd.org, Bruce Evans <brde@optusnet.com.au>, Konstantin Belousov <kostikbel@gmail.com>, Julian Elischer <julian@freebsd.org>
Subject:   Re: svn commit: r274017 - head/sys/kern
Message-ID:  <20141105040359.F1613@besplex.bde.org>
In-Reply-To: <20141104151101.GG29192@spindle.one-eyed-alien.net>
References:  <201411030746.sA37kpPu037113@svn.freebsd.org> <54573AEE.9010602@freebsd.org> <54573B87.7000801@freebsd.org> <54573CD2.1000702@selasky.org> <20141103092132.GH29497@dft-labs.eu> <20141103100847.GK53947@kib.kiev.ua> <20141104045159.E1605@besplex.bde.org> <79454.1415043356@critter.freebsd.dk> <20141104054144.GB4032@dft-labs.eu> <20141104151101.GG29192@spindle.one-eyed-alien.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 4 Nov 2014, Brooks Davis wrote:

> On Tue, Nov 04, 2014 at 06:41:44AM +0100, Mateusz Guzik wrote:
>> re-sent with trimmed cc
>>
>> On Mon, Nov 03, 2014 at 07:35:56PM +0000, Poul-Henning Kamp wrote:
>>> --------
>>> In message <20141104045159.E1605@besplex.bde.org>, Bruce Evans writes:
>>>
>>>> This optimization is probably minor, but reminds me of other syscalls
>>>> that would benefit using a single largish allocation up front:
>>>> - all vfs calls that start with namei().  They allocate PATH_MAX (1K)
>>>>   bytes and more.  An 8K stack has plently to spare after allocating
>>>>   1K.  However, if malloc() is used then the PATH_MAX limit shouldn't
>>>>   exist.   Only a limit to prevent denial of service is needed.
>>>
>>> We should actually roll a new rev of all syscalls which take a path
>>> and have them pass strlen(path)+1 into the kernel.
>>>
>>> That would allow both precise allocation and faster copyin, followed
>>> by a check that the last byte is (still) a NUL.
>>
>> I think we can speed up things on amd64 no problem without affecting
>> userspace.
>>
>> amd64's copyinstr (and most likely all others) copy byte by byte. What
>> we can do is copy word size and fall back to byte by byte on page fault.
>> Ten on each iteration check if 0 was encountered.

The speed of copyinstr is unimportant.  If it were, then someone would
have optimized it 25 years ago when memory was slower and caches and
pipelines and branch prediction were smaller or nonexistent, so that
the improvement would have been larger, bith relatively and absolutely.

amd64 ran for many years with unoptimized C versions of most string
functions in libc.  These were only 2-8 times slower than best possible.
No one noticed the slowness or the speedup from changing them.  copyinstr
is a few orders of magnitude (base 10) less important.

> As long as you handle unaligned starting addresses, you can safely copy
> aligned-word by aligned-word without the risk of spurious pagefaults
> (libc makes this assumption all over the place).  You may trip other
> potential memory protection mechanims which aren't limited to page
> alignment, but for exisiting hardware you don't need to worry about it.

x86 libc string functions mostly avoid doing alignment, since that would
just be slower for the usual case where everthing is aligned.  This
depends on the hardware allowing misaligned accesses and being only
slightly slower for them.  The main exceptions are in functions
"optimized" to do word-sized accesses.  They have to align to avoid
running off the end of arrays and getting pagefaults or worse.  I doubt
that word accesses is a useful optimization for normal use of string
functions.  Testing showed that the extra startup time is amortized
by the string length being just 1 or 2 words, but that is for cached
strings.  The cached case now runs so fast (several GB/sec even with
bytewise accesses) that it is hard to find uses where its overhead is
not in the noise.  The uncached case runs slow due to cache misses;
bytewise accesses add a little noise to its slowness.  Pathname lookups
in the kernel are probably closest to the uncached case.  A typical use
is 1 pathname access per file (to open the file).

> That said, I prefer phk's suggestion in many ways.  It places the burden on
> userspace where it belongs and allows precise allocation in the kernel.

I don't prefer it, of course.  The kernel still can't trust any user
parameters.  copyin() may be a little faster than copyinstr(), but now
userland is slower since it has to do a strlen() before every syscall
that passes a counted string.  Better try the modest project of changing
so to require counted strings everywhere.

BTW, copystr() should never have been implemented in asm.  It just gives
copyinstr()'s error handling for kernel strings.  This is convenient, but
"optimizing" copystr() by writing it in asm is even less important than
for copyinstr().  It can be written in a few lines of C using standard
string functions, e.g., by using almost the same code as phk proposed for
user pathnames:

 	size_t lencopied, slen;

 	slen = strlen(kfaddr);
 	lencopied = slen + 1;
 	if (lencopied >= len)
 		lencopied = len;
 	bcopy(kfaddr, kdaddr, lencopied);
 	if (done != NULL)
 		*done = lencopied;
 	return (slen < len ? 0 : ENAMETOOLONG);

This is a little simpler than for copying from user space, since bcopy()
can't trap.

This is likely to be faster than the "optimized" asm versions since it
doesn't always use bytewise accesses.  strlen() gives unnecessary accesses,
but they might not be bytewise.

The man page is quite quite broken:
- it says that the [source] string is at most 'len' bytes long, but I
   think 'len' is actually the size of the target buffer; this should be
   at least 1 larger than the source or target string length.
- it doesn't say if the target buffer contains a (terminated) string on
   error.  Since termination is impossible when len == 0, it cannot
   require termination.

These bugs are missing in at least the i386 asm implementation.  I think
it just copies as much of the source string as possible.  The result
is terminated iff there is space for the terminator.  The above is
supposed to match the i386 implementation, not the man page.

copystr() can now be implemented even more simply but much more slowly
using snprintf().

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20141105040359.F1613>