Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 21 Oct 2013 07:11:39 +0000
From:      "Poul-Henning Kamp" <phk@phk.freebsd.dk>
To:        John-Mark Gurney <jmg@funkthat.com>
Cc:        Mark R V Murray <mark@grondar.org>, freebsd-arch@FreeBSD.org
Subject:   Re: always load aesni or load it when cpu supports it
Message-ID:  <5353.1382339499@critter.freebsd.dk>
In-Reply-To: <20131020161634.GQ56872@funkthat.com>
References:  <20131020070022.GP56872@funkthat.com> <423D921D-6CE5-49D9-BCED-AB14EB236800@grondar.org> <20131020161634.GQ56872@funkthat.com>

next in thread | previous in thread | raw e-mail | index | archive | help
In message <20131020161634.GQ56872@funkthat.com>, John-Mark Gurney writes:

>It does look like we already have a good number of consumers for
>crypto/rijndael: geom_bde, ipsec, random and wlan_ccmp...  Which
>also means that they aren't making use of AES accelerator cards...

The reason GBDE didn't use OpenCrypto was that it was horribly slow
compared to direct CPU execution.   I couldn't find one single
computer where using the available hardware were faster.

I spent a lot of time with HiFn chips, and later with the Via chips
instruction-based AES, and a rather clear picture emerged for me.

"Distant crypto HW", like the HiFn and pretty much anything else
on the far side of the L[123] cache, is unsuitable for what I will
call "synchronous" crypto, where the CPU needs to do something and
then continue with the result.

It _can_ work for "asynchronous crypto", where the CPU queues some
work and the hw-crypto interrupt handler can schedule it where it
needs to go next, typically a device-driver queue.

With the overheads I measured, you still need pretty massive amounts
of traffic before it pays off, or put another way:  As long as you
have free CPU-cycles, it will not.

I havn't looked at opencrypto recently, but back then it was pretty
much a IPSEC facility with a proof-of-concept userland device driver.

I tried to add a more generic facility so that it could also be
usable for disk-I/O, and when that failed to get results I added a
GEOM specific facility, but even that I never managed to get to
improve GBDE performance[1], so I never committed it.

My suggestion moving forward, is to implement this distinction between
"synchronous crypto" and "asynchronous crypto" (or maybe "CPU crypto"
vs. "IO crypto" ?) in the architecture, and stop pretending that
OpenCrypto will ever cater to both needs.

For CPU crypto I would simply do the memcpy() thing:  Have a
function pointer replaced with CPU-specific code if available.

Please notice that this should happen in userland too, and should
be standardized across operating systems, so ports can use it
to forego their private C&P copies of common crypto algorithms.
(see also: http://queue.acm.org/detail.cfm?id=1944489)

Also notice that we will see more of this kind of "CISC-Creep" in
the future:  Intel and AMD needs to find ways to spend transistors
to claim speedups, so we will get more and more weird instructions
for speeding up tight loops.  Make whatever you do able to also
handle when sprintf(3) becomes an instruction.

Poul-Henning

[1] GBDE is a bit of a trouble-maker because it changes keys all
the time, but unless you can dedicate a crypto-instance to you
don't have to do key-setup, this makes no difference in practice.
OpenCrypto did not have support for "reserving" crypto instances
this way last I looked.

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5353.1382339499>