From owner-cvs-all Mon Jun 25 8:58:33 2001 Delivered-To: cvs-all@freebsd.org Received: from earth.backplane.com (earth-nat-cw.backplane.com [208.161.114.67]) by hub.freebsd.org (Postfix) with ESMTP id A990C37B406; Mon, 25 Jun 2001 08:58:27 -0700 (PDT) (envelope-from dillon@earth.backplane.com) Received: (from dillon@localhost) by earth.backplane.com (8.11.3/8.11.2) id f5PAsrp04325; Mon, 25 Jun 2001 03:54:53 -0700 (PDT) (envelope-from dillon) Date: Mon, 25 Jun 2001 03:54:53 -0700 (PDT) From: Matt Dillon Message-Id: <200106251054.f5PAsrp04325@earth.backplane.com> To: Bruce Evans Cc: Mikhail Teterin , jlemon@FreeBSD.ORG, cvs-committers@FreeBSD.ORG, cvs-all@FreeBSD.ORG Subject: Re: Inline optimized bzero (was Re: cvs commit: src/sys/netinet tcp_subr.c) References: Sender: owner-cvs-all@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG :I would expect the opposite. If the bzero's in the networking code don't :show up in the network latency benchmarks, where would they show up? ISTR :that a Linux hacker who made lmbench1 go faster for Linux saying that the :bzero() at the start of the FreeBSD tcp_input() is a really stupid thing :to do. But I think even completely eliminating it would be just another :micro-optimization, worth 1% in favourable cases, so you need 10 more like :it to give a useful speedup. I wouldn't expect any incremental change to have a noticeable effect on something like lmbench. From my perusal of the code, the few bzero's in tcp/ip's critical path are only likely to save a few hundred nanoseconds per packet, so any noticeable effect would tend to occur in a system handling lots of simultanious connections and lots of smaller packets. Even then I wouldn't expect much of an effect in a single subsystem. The other effects are going to be scattered. In syscalls, getfh() will be 100nS faster. In kern_descrip.c, falloc() and fdinit() will be faster because the structures being bzero'd are tiny. There are a bunch of places in netinet where small bzero()'s are in the critical path - not just for TCP - where exercising that particular subsystem should yield a benefit. The main point is that the effect can only be better. I can try to work the kernel size down so there is no bloat at all, but right now the average change is less then one byte per bzero call. -Matt :... :> it added 6ns to the loop, which is fine, but it blew up the constant :> optimization and wound up adding a switch table and a dozen :> instructions inline (hundreds of bytes!). : :Yes, it's clear that alignment is not worth doing in the kernel. Userland :is different -- the application might have turned on alignment checking, :or it might be poorly behaved and pass a lot of unaligned buffers. gcc :is primarily a userland compiler, so it's a little surprising that its :builtins don't worry about alignment. : :> I added alignment checks to i586_bzero but it ate 20nS. Also, :> it should be noted that i586_bzero() as it currently stands does not :> do any alignment checks either - it checks only the size argument, :> it doesn't check the base pointer. : :Neither does generic_bzero(). i586_bzero() just turns itself into :generic_bzero() for small sizes. I'm fairly sure that I benchmarked :this, and came to the conclusion that there is nothing significanttly :better than "rep movsl" when the size isn't know at compile time. In :particular, lots of jumps as in i486_bzero are actively bad. This may :be P5-specific (branch prediction is not very good on original Pentiums). : :Bruce To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe cvs-all" in the body of the message