Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 12 Jan 2015 15:34:11 -0800
From:      John-Mark Gurney <jmg@funkthat.com>
To:        rozhuk.im@gmail.com
Cc:        freebsd-hackers@freebsd.org, freebsd-geom@freebsd.org
Subject:   Re: ChaCha8/12/20 and GEOM ELI tests
Message-ID:  <20150112233411.GP1949@funkthat.com>
In-Reply-To: <54b43144.2d08980a.437b.0f8f@mx.google.com>
References:  <54b33bfa.e31b980a.3e5d.ffffc823@mx.google.com> <20150112072249.GM1949@funkthat.com> <54b43144.2d08980a.437b.0f8f@mx.google.com>

next in thread | previous in thread | raw e-mail | index | archive | help
rozhuk.im@gmail.com wrote this message on Mon, Jan 12, 2015 at 23:40 +0300:
> > > Cha?ha patch:
> > >
> > http://netlab.linkpc.net/download/software/FreeBSD/patches/chacha.patch
> > 
> > What's the difference between CHACHA and XCHACHA?
> 
> Same as between SALSA and XSALSA.
> 
> XChaCha20 uses a 256-bit key as well as the first 128 bits of the nonce in
> order to compute a subkey. This subkey, as well as the remaining 64 bits of
> the nonce, are the parameters of the ChaCha20 function used to actually
> generate the stream.
> 
> But with XChaCha20's longer nonce, it is safe to generate nonces using
> randombytes_buf() for every message encrypted with the same key without
> having to worry about a collision.
> 
> More details: http://cr.yp.to/snuffle/xsalsa-20081128.pdf

Ahh, thanks..

> > Also, where are the man page diffs?  They might have explained the
> > difference between the two, and explained why two versions of chacha
> > are needed...
> 
> No man page diffs.

You need to document the new defines in crypto(9), and document the
various parameters in crypto(7)...  Yes, not all modes are documented
in crypto(7), but going forward, at a minimum we need to document new
additions...

I'll admit I didn't document the other algorithms as I'm not as familar
w/ those as the ones that I worked one...

> Man pages does not explain difference between AES-CBC and AES-XTS...

True, but CBC and XTS (which includes a reference to the standard) are
a lot more searchable/common knowlege than xchacha..  google thinks you
mean chacha, and xchacha just turns up a bunch of people on various
networks... Not until you search on xchacha crypto do you get a relevant
page...  Also, wikipedia doesn't have an entry for xchacha, nor does
the chacha (cipher) page list it...  So, when documenting xchacha in
crypto(7), include a link to the description/standard...

> > Is there a reason you decided to write your own ChaCha implementation
> > instead of using one of the standard ones?  Did you run performance
> > tests between your implementation and others?
> 
> Reference ChaCha and reference (FreeBSD) XTS (4k sector):
> ChaCha8-XTS-256   = 199518722 bytes/sec
> ChaCha12-XTS-256  = 179029849 bytes/sec
> ChaCha20-XTS-256  = 149447317 bytes/sec
> XChaCha8-XTS-256  = 195675728 bytes/sec
> XChaCha12-XTS-256 = 175790196 bytes/sec
> XChaCha20-XTS-256 = 147939263 bytes/sec

So, you're seeing a 33%-50% improvement, good to hear...

Also, do you publish this implementation somewhere?  If so, it'd be
helpful to include a url to where up to date versions can be obtained...
If you don't plan on publishing/maintaining it outside of FreeBSD, then
we need to unifdef out the Windows parts of it for our tree...

> This is the reference version adapted for use in /dev/crypto.
> chacha_block_unaligneg() - processing the reference version of a data block.
> Macros are used for readability.
> chacha_block_aligned() - the same but the work on the aligned data.

Please use the macro __NO_STRICT_ALIGNMENT to decide if special work
is necessary to handle the alignment...

What is the CHACHA_X64 macro for?  If that is to detect LP64 platforms,
please use the macro __LP64__ to decide this...  Have you done
performance evaluations on 32bit arches to make sure double rounds aren't
a benefit there too?

Use the byteorder(9) macros to encode/decode integers instead of rolling
your own (U8TO32_LITTLE and U32TO8_LITTLE)...  Turns out compilers aren't
good at optimizing this type of code, and platforms may have assembly
optimized versions for these...

> To increase speed, instead of one byte is processed for 4/8 byte times.
> The data in the context of an 8-byte aligned.
> To increase security, all data, including temporary, saved in a context that
> on completion of the work is filled with zeros.

Please use the function explicite_bzero that is available for all of
these instead of creating your own..

> > > HW: Core Duo E8500, 8Gb DDR2-800.
> > > dd if=/dev/zero of=/dev/md0 bs=1m
> > > 2148489421 bytes/sec
> > >
> > >
> > > # sector = 512b
> > > 3DES-CBC-192      =  20773120 bytes/sec
> > > AES-CBC-128       =  85276853 bytes/sec
> > > AES-CBC-256       =  68893016 bytes/sec
> > > AES-XTS-128       =  68194868 bytes/sec
> > > AES-XTS-256       =  56611573 bytes/sec
> > > Blowfish-CBC-128  =  11169657 bytes/sec
> > > Blowfish-CBC-256  =  11185891 bytes/sec
> > > Camellia-CBC-128  =  78077243 bytes/sec
> > > Camellia-CBC-256  =  65732219 bytes/sec
> > > ChaCha8-XTS-256   = 258042765 bytes/sec
> > > ChaCha12-XTS-256  = 223616967 bytes/sec
> > > ChaCha20-XTS-256  = 176005366 bytes/sec
> > > XChaCha8-XTS-256  = 228292624 bytes/sec
> > > XChaCha12-XTS-256 = 195577624 bytes/sec
> > > XChaCha20-XTS-256 = 152247267 bytes/sec
> > > XChaCha20-XTS-128 = 152717737 bytes/sec ! 128 bit key have same speed
> > > as 256
> > >
> > >
> > > # sector = 4kb
> > > 3DES-CBC-192      =  22018189 bytes/sec
> > > AES-CBC-128       = 104097143 bytes/sec
> > > AES-CBC-256       =  81983833 bytes/sec
> > > AES-XTS-128       =  78559346 bytes/sec
> > > AES-XTS-256       =  66047200 bytes/sec
> > > Blowfish-CBC-128  =  38635464 bytes/sec
> > > Blowfish-CBC-256  =  38810555 bytes/sec
> > > Camellia-CBC-128  =  92814510 bytes/sec
> > > Camellia-CBC-256  =  75949489 bytes/sec
> > > ChaCha8-XTS-256   = 337336982 bytes/sec
> > > ChaCha12-XTS-256  = 284740187 bytes/sec
> > > ChaCha20-XTS-256  = 217326865 bytes/sec
> > > XChaCha8-XTS-256  = 328424551 bytes/sec
> > > XChaCha12-XTS-256 = 278579692 bytes/sec
> > > XChaCha20-XTS-256 = 211660225 bytes/sec
> > >
> > > Optimized AES-XTS - speed like AES-CBC:
> > > AES-XTS-128       = 102841051 bytes/sec
> > > AES-XTS-256       =  80813644 bytes/sec
> > 
> > Is this from a different patch or what?  Can you talk more about this?
> 
> No patch at this moment.
> After optimization ChaCha-XTS I applied these optimizations to the AES-XTS
> and get this result.
> All changes were aes_xts_reinit() and aes_xts_crypt(), just slightly changed
> the structure aes_xts_ctx.
> 
> aes_xts_ctx:
> u_int8_t tweak[] -> u_int64_t tweak[]
> 
> aes_xts_reinit -> same as chacha_xts_reinit()
> 
> aes_xts_crypt -> same as chacha_xts_crypt():
> block[] - temp buf removed;
> xor 1 byte -> xor 8 bytes at once;
> tweak[i] << 1: rotl 1 bit: 1 byte -> 8 bytes;
> unroll loops;

Ahh, I thought I had done some similar optimizations, but I only did
them to the aesni version of the routines...  You should use the macro
above to decide if things are aligned or not...

> 
> Final:
> 
> struct aes_xts_ctx {
> 	rijndael_ctx key1;
> 	rijndael_ctx key2;
> 	uint64_t tweak[(AES_XTS_BLOCKSIZE / sizeof(uint64_t))];
> };
> 
> void
> aes_xts_reinit(caddr_t key, u_int8_t *iv)
> {
> 	struct aes_xts_ctx *ctx = (struct aes_xts_ctx *)key;
> 
> 	/*
> 	 * Prepare tweak as E_k2(IV). IV is specified as LE representation
> 	 * of a 64-bit block number which we allow to be passed in directly.
> 	 */
> 	if (ALIGNED_POINTER(iv, uint64_t)) {
> 		ctx->tweak[0] = (*((uint64_t*)(void*)iv));
> 	} else {
> 		bcopy(iv, ctx->tweak, sizeof(uint64_t));
> 	}
> 	/* Convert to LE. */
> 	ctx->tweak[0] = htole64(ctx->tweak[0]);

Hmm... this line bothers me.. I'll need to spend more time reading up
to decide if it is buggy or not...  Is ctx->tweak in host order? or LE
order?  I believe it's suppose to be LE order, as it gets passed
directly to _encryt..  I'm also not sure if the original code is BE
clean, which is part of my problem...

> 	/* Last 64 bits of IV are always zero */
> 	ctx->tweak[1] = 0;
> 
> 	rijndael_encrypt(&ctx->key2, (uint8_t*)ctx->tweak,
> (uint8_t*)ctx->tweak);
> }
> 
> static void
> aes_xts_crypt(struct aes_xts_ctx *ctx, u_int8_t *data, u_int do_encrypt)
> {
> 	size_t i;
> 	uint64_t crr, tm;
> 
> 	if (ALIGNED_POINTER(blk, uint64_t)) {
> 		((uint64_t*)(void*)data)[0] ^= ctx->tweak[0];
> 		((uint64_t*)(void*)data)[1] ^= ctx->tweak[1];
> 	} else {
> 		for (i = 0; i < AES_XTS_BLOCKSIZE; i ++)
> 			data[i] ^= ((uint8_t*)ctx->tweak)[i];
> 	}
> 
> 	if (do_encrypt)
> 		rijndael_encrypt(&ctx->key1, data, data);
> 	else
> 		rijndael_decrypt(&ctx->key1, data, data);
> 
> 	if (ALIGNED_POINTER(blk, uint64_t)) {
> 		((uint64_t*)(void*)data)[0] ^= ctx->tweak[0];
> 		((uint64_t*)(void*)data)[1] ^= ctx->tweak[1];
> 	} else {
> 		for (i = 0; i < AES_XTS_BLOCKSIZE; i ++)
> 			data[i] ^= ((uint8_t*)ctx->tweak)[i];
> 	}
> 
> 	/* Exponentiate tweak */
> 	crr = (ctx->tweak[0] >> ((sizeof(uint64_t) * 8) - 1));
> 	ctx->tweak[0] = (ctx->tweak[0] << 1);
> 
> 	tm = ctx->tweak[1];
> 	ctx->tweak[1] = ((tm << 1) | crr);
> 	crr = (tm >> ((sizeof(uint64_t) * 8) - 1));
> 
> 	if (crr)
> 		ctx->tweak[0] ^= 0x87; /* GF(2^128) generator polynomial. */

Please use the AES_XTS_ALPHA define instead of hardcoding the value..

Thanks.

-- 
  John-Mark Gurney				Voice: +1 415 225 5579

     "All that I will do, has been done, All that I have, has not."



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150112233411.GP1949>