Date: Mon, 21 Oct 2013 07:11:39 +0000 From: "Poul-Henning Kamp" <phk@phk.freebsd.dk> To: John-Mark Gurney <jmg@funkthat.com> Cc: Mark R V Murray <mark@grondar.org>, freebsd-arch@FreeBSD.org Subject: Re: always load aesni or load it when cpu supports it Message-ID: <5353.1382339499@critter.freebsd.dk> In-Reply-To: <20131020161634.GQ56872@funkthat.com> References: <20131020070022.GP56872@funkthat.com> <423D921D-6CE5-49D9-BCED-AB14EB236800@grondar.org> <20131020161634.GQ56872@funkthat.com>
next in thread | previous in thread | raw e-mail | index | archive | help
In message <20131020161634.GQ56872@funkthat.com>, John-Mark Gurney writes: >It does look like we already have a good number of consumers for >crypto/rijndael: geom_bde, ipsec, random and wlan_ccmp... Which >also means that they aren't making use of AES accelerator cards... The reason GBDE didn't use OpenCrypto was that it was horribly slow compared to direct CPU execution. I couldn't find one single computer where using the available hardware were faster. I spent a lot of time with HiFn chips, and later with the Via chips instruction-based AES, and a rather clear picture emerged for me. "Distant crypto HW", like the HiFn and pretty much anything else on the far side of the L[123] cache, is unsuitable for what I will call "synchronous" crypto, where the CPU needs to do something and then continue with the result. It _can_ work for "asynchronous crypto", where the CPU queues some work and the hw-crypto interrupt handler can schedule it where it needs to go next, typically a device-driver queue. With the overheads I measured, you still need pretty massive amounts of traffic before it pays off, or put another way: As long as you have free CPU-cycles, it will not. I havn't looked at opencrypto recently, but back then it was pretty much a IPSEC facility with a proof-of-concept userland device driver. I tried to add a more generic facility so that it could also be usable for disk-I/O, and when that failed to get results I added a GEOM specific facility, but even that I never managed to get to improve GBDE performance[1], so I never committed it. My suggestion moving forward, is to implement this distinction between "synchronous crypto" and "asynchronous crypto" (or maybe "CPU crypto" vs. "IO crypto" ?) in the architecture, and stop pretending that OpenCrypto will ever cater to both needs. For CPU crypto I would simply do the memcpy() thing: Have a function pointer replaced with CPU-specific code if available. Please notice that this should happen in userland too, and should be standardized across operating systems, so ports can use it to forego their private C&P copies of common crypto algorithms. (see also: http://queue.acm.org/detail.cfm?id=1944489) Also notice that we will see more of this kind of "CISC-Creep" in the future: Intel and AMD needs to find ways to spend transistors to claim speedups, so we will get more and more weird instructions for speeding up tight loops. Make whatever you do able to also handle when sprintf(3) becomes an instruction. Poul-Henning [1] GBDE is a bit of a trouble-maker because it changes keys all the time, but unless you can dedicate a crypto-instance to you don't have to do key-setup, this makes no difference in practice. OpenCrypto did not have support for "reserving" crypto instances this way last I looked. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5353.1382339499>