-
Notifications
You must be signed in to change notification settings - Fork 5
Lightweight cryptography
License
arachsys/pocketcrypt
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Pocketcrypt =========== Pocketcrypt is a tiny legacy-free cryptographic library providing duplex constructions using Gimli or Xoodoo, together with X25519 for key exchange and Schnorr signatures. Eschewing interoperability with the standard museum of primitives and protocols, it offers concise, easily-understood code that avoids the ugly boilerplate and obfuscation of larger libraries. Beware! This early version of Pocketcrypt is neither formally audited nor officially released. Safely composing these low-level primitives requires cryptographic expertise. Gimli and Xoodoo are also relatively new and still under active cryptanalytic scrutiny. Please review the relevant literature and audit the implementation carefully before considering this library. Duplex ====== duplex.h is a fast, architecture-independent implementation of the Gimli and Xoodoo permutations using gcc/clang vector extensions, together with associated duplex operations. To use this header-only library, copy duplex.h into your tree. There are no link-time dependencies. Initialising and permuting state -------------------------------- Define duplex_permute as duplex_gimli or duplex_xoodoo before including duplex.h to explictly select a permutation. The default is duplex_xoodoo. Both permutations operate over a state of twelve 32-bit words, manipulated here as a vector array uint32x4_t[3]. When individual bytes of the state are required, the uint32_t words are accessed little-endian, ensuring results are independent of platform byte order. Internally, higher-level duplex operations absorb and squeeze bytes in 16-byte chunks corresponding to the permutation rate. By maintaining a byte counter alongside the duplex state, operations can also seamlessly absorb, squeeze, encrypt and decrypt partial chunks. The combined state and counter are stored as duplex_t, a type alias for uint32x4_t[4]. For convenience, duplex.h defines duplex_rate as 16 (the permutation rate) and duplex_size as sizeof(duplex_t) to help avoid magic numbers for nonce, tag and state sizes in client code. Initialise a duplex state including the byte counter with duplex_t state = { 0 }; Use duplex_counter(state) to dereference the uint64_t counter as an lvalue. To directly permute the state, use duplex_permute(state); Typically this is bundled into the higher-level operations. Absorbing and squeezing data ---------------------------- Absorb a buffer of bytes into a duplex state with duplex_absorb(state, data, length); Each rate-sized chunk is absorbed in turn, then the state is permuted before continuing. Once any final partial chunk is absorbed, the counter will advance by length ready for the next duplex operation. Squeeze data from a duplex state into a buffer with duplex_squeeze(state, data, length); Bytes are extracted from the duplex state into the buffer, permuting the state after each rate-sized chunk. Once any final partial chunk is squeezed, the counter will advance by length as with duplex_absorb(). Encryption and decryption ------------------------- To encrypt a plaintext using a duplex state, for example after absorbing a shared key and nonce, call duplex_encrypt(state, data, length); Similarly, decrypt a ciphertext using a duplex state with duplex_decrypt(state, data, length); The rate portion of the state is mixed into rate-sized chunks of the buffer in turn. Before permuting and continuing to the next chunk, the plaintext is absorbed back into the rate. (For encryption this is the original chunk; for decryption it is the updated chunk.) The counter will advance by length. To implement authenticated encryption, squeeze and append a rate-sized tag after encrypting a message and padding the state. This can then be checked against duplex_rate bytes squeezed by the recipient after decryption. Padding ------- To prevent extension attacks following an operation on variable-length data, pad the state with duplex_pad(state); after the final duplex_absorb(), duplex_encrypt() or duplex_decrypt(). This pads the final partial/empty chunk then forks the capacity part of the state before permuting it, as described in the Gimli NIST submission. The counter will advance to the next multiple of the rate. When designing protocols, compulsory fixed-length inputs that are naturally multiples of duplex_rate such as keys or nonces can safely be absorbed unpadded. However, to avoid extension attacks, variable-length messages must be padded even if they happen to be an exact multiple of the rate. Ratcheting ---------- To ensure forward secrecy during a session, call duplex_ratchet(state); This irreversibly ratchets the duplex state by zeroing the rate portion of the state and permuting, thus preventing rollback even from a completely compromised state. The counter will advance by exactly duplex_rate. Constant-time comparison ------------------------ Compare two equal-sized byte arrays in constant time with duplex_compare(a, b, length); This returns 0 for equality, -1 otherwise. If a or b is null, the other argument is compared with zero. This is useful for validating authentication tags or checking other secret data without inadvertently revealing the location of the first discrepancy through the time taken to detect it. Clearing sensitive data ----------------------- Clear sensitive data such as keys or cleartext using duplex_zero(data, length); Unlike memset() or bzero(), the compiler is forbidden to optimise this away as a 'dead store' even if the buffer is subsequently discarded or unused. Serialising duplex state ------------------------ Use duplex_byte(state, index) to dereference individual bytes of a duplex state as lvalues, where 0 <= index < duplex_size. This macro may evaluate the index argument more than once. The resulting representation is independent of host byte order: the twelve 32-bit state words are accessed in turn, followed by the 64-bit counter, each in standard little-endian order. The last eight bytes of duplex_t are serialised as a second 64-bit integer. They are unused by duplex.h and can safely be discarded. Alternatively, use duplex_extra(state) to address them as an auxiliary uint64_t lvalue. For example, to serialise state to uint8_t packed[56], use for (size_t i = 0; i < sizeof(packed); i++) packed[i] = duplex_byte(state, i); and it can later be restored from this packed form with for (size_t i = 0; i < sizeof(packed); i++) duplex_byte(state, i) = packed[i]; Similar loops can be used to store or stream serialised state directly. Implementation notes -------------------- This is a straightforward vector conversion, trivial to check against the reference Gimli and Xoodoo implementations. There is no manual unrolling of loops and architecture-specific intrinsics are not needed, but compiled with -O3 -march=native using gcc or clang on x86-64, this code is as fast as the optimised SSE3 Gimli submitted to NIST and the Keccak team's SSE2 Xoodoo in XKCP. Modern gcc and clang compile vector extensions impressively well. At the time of writing, duplex.h runs about 5% faster compiled with clang 13.0.1 than with gcc 11.2.0 on AMD Ryzen 4800U. Byte-by-byte duplex calls achieve 66% and 62% of the throughput of bulk encryption and decryption on gcc and clang respectively, rising to 99% and 96% for 16-byte operations. Even with state-of-the-art compilers, vector types are worthwhile. On the same AMD Ryzen 4800U, when rewritten as a loop over four uint32_t columns, permutation takes 30% longer with clang 13.0.1 and more than double the time with gcc 11.2.0. Similarly, a simpler byte-by-byte absorb/squeeze loop sacrifices around 20% bulk throughput with both compilers. Although duplex.h works on both little- and big-endian architectures, it will refuse to build on a mixed-endian system even if you contrive one with sufficient gcc support. Both duplex_gimli() and duplex_xoodoo() explicitly shuffle uint32x4_t rows as uint8x16_t vectors to optimise byte-aligned rotations and word-exchanges. This improves Xoodoo and Gimli throughput by 10% and 20% respectively with gcc 11.2.0, although the performance gap is much smaller on clang. For information on Gimli, see https://gimli.cr.yp.to/gimli-20170627.pdf and the documentation at https://csrc.nist.gov/projects/lightweight-cryptography where Gimli is a candidate. Sponge and duplex constructions are well-covered by https://eprint.iacr.org/2011/499.pdf including generic security analysis. For information on Xoodoo, see https://keccak.team/xoodoo.html and https://eprint.iacr.org/2018/767.pdf - but note the Pocketcrypt duplex construction is the simple one specified for Gimli rather than the more complicated Xoodyak cyclist object. X25519 ====== This X25519 implementation is adapted and extended from the elegant code in Mike Hamburg's STROBE protocol framework at https://strobe.sourceforge.io/ It is very portable but can detect and take advantage of 128-bit integer types where they are available. Prototypes for available operations are in x25519.h. Code using them must be linked against x25519.c. To use the library, copy both files into your tree. Both curve points (represented by Montgomery x-coordinates) and scalars are manipulated as 32-byte little-endian arrays. The correct-sized array type and standard base point (generator) are defined by x25519.h as typedef uint8_t x25519_t[x25519_size]; const x25519_t x25519_base = { 9 }; x25519.c assumes constant-time integer multiplication. This is valid for modern x86-64 and arm64 processors, but variable-time multiplies on some embedded platforms may introduce timing leaks. The library currently runs faster compiled with clang -O3 than with gcc -O3. For best performance on clang, build with aggressive function inlining using -mllvm -inline-threshold=5000 to obtain 30-35% faster code. However, raising the analogous -finline-limit value on gcc appears to hinder performance. Scalar multiplication --------------------- Multiply an x25519_t curve point by an x25519_t scalar with x25519(out, scalar, point); x25519() returns -1 if the product vanishes, otherwise 0. This is used to detect non-contributory behaviour as described below. Generate a key pair (sk, pk) by randomising sk then calling x25519(pk, sk, x25519_base); to calculate pk. Similarly, calculate a shared secret corresponding to sk and pk by calling x25519(key, sk, pk); x25519(key, sk1, pk2) and x25519(key, sk2, pk1) will generate the same key for all pairs (sk1, pk1) and (sk2, pk2). The computational ECDH assumption is that recovering this key with neither sk1 nor sk2 is infeasible. Shared keys have high entropy but as curve points they are not free of structure. They are safe to absorb into a duplex construction or otherwise hash to obtain unbiased bits. For an overview of X25519, see https://cr.yp.to/ecdh/curve25519-20060209.pdf and https://tools.ietf.org/html/rfc7748 sections 4.1 and 6.1. Small-subgroup confinement attacks ---------------------------------- Some protocols require that neither participant has sole control of a shared secret. However, the curve has cofactor 8 and its twist has cofactor 4, so there exist a handful of low-order torsion points which generate at most 8 distinct scalar products. An attacker might submit these as public keys. The simplest way to detect this non-contributory behaviour is to generate secret keys as multiples of eight by masking with sk[0] &= 0xf8 after randomising sk[]. Such keys annihilate any torsion component, so if x25519() returns 0, the scalar product is non-zero and the key exchange was safe. RFC 7748 itself specifies a clamping operation on both scalars and points, implemented as clamp() in test/known-x25519.c. Alas, rather than framing it as part of key generation, the RFC bundles it into key exchange. This is unfortunate as it is not well-defined modulo the base point order so doesn't preserve group structure, causing problems if (x + y)P = xP + yP is needed, such as with Schnorr signatures. It is not bundled into this x25519(). For cases where the scalar is derived rather than generated (perhaps from hierarchical key assignment) and where the group structure needs to be preserved, a cleaner option is to map it to a torsion-safe representative. Given any 32-byte scalar, use x25519_scalar(out, scalar) to cheaply calculate a representative congruent to the original modulo the order of the base point, and so whose product with valid public keys in the prime-order subgroup is unchanged, but which is also a multiple of eight to annihilate any torsion component under scalar multiplication. None of this complexity is needed for standard Diffie-Hellman exchanges, and wherever possible, it is preferable to design protocols that do not rely on contributory behaviour. Scalar inversion ---------------- Invert a scalar modulo the order of the X25519 base point with x25519_invert(out, scalar); This is typically used to remove a blinding factor from a point in oblivious pseudorandom functions. For example, given a compound scalar product rsG, further multiplying by the scalar inverse of r or s will recover sG or rG respectively. Mapping field elements to curve points -------------------------------------- For any field element, call x25519_point(out, element); to map it onto a curve point using the Elligator 2 mapping with non-square parameter u = 2. See https://elligator.cr.yp.to/elligator-20130828.pdf for more details. A point on the full curve is returned, not necessarily in the prime-order subgroup. If an attacker has control of the input element in a protocol, the earlier discussion of small-subgroup confinement might be relevant. The map is efficiently invertible and its range is around half of the points on the curve. Assuming the probability of efficiently calculating discrete logarithms for random curve points is negligible, the same is therefore true for the images of random field elements under this function. This is a building block for hash-to-curve functions. For example, if two parties already share a secret duplex state, an ephemeral key exchange can be authenticated by substituting a secret point for the standard base point. Squeeze a uniformly-distributed x25519_t field element from the shared state, map it to a curve point using this function, then pick multiples of eight as ephemeral key-exchange secrets to avoid small-subgroup attacks. Signatures ---------- STROBE-compatible X25519 Schnorr signatures are also supported. These are different from standard Ed25519 signatures, but they minimise additional code expenditure in protocols based around X25519 key exchange. To sign a 32-byte scalar challenge c with identity key pair (a, A), generate an ephemeral key pair (e, E) then call x25519_sign(response, challenge, ephemeral, identity); This calculates the scalar response s = e + ca (mod l), where l is the order of the X25519 base point. Discard the ephemeral secret e. The signature is the 64-byte pair (E, s). Given a scalar response s, scalar challenge c, ephemeral public key E and identity public key A, call x25519_verify(response, challenge, ephemeral, identity); to verify the response. This checks sG = ± E ± cA and rules out torsion points. It returns 0 for a valid signature, -1 otherwise. Schnorr challenges must hash the ephemeral public key as well as the message to be signed, because the prover must commit before the verifier challenges in the corresponding sigma protocol. Absorb the ephemeral public key on top of a duplex state before squeezing a challenge to sign that state. The signer's public identity should also be absorbed before signing unless the state is already bound to it. https://eprint.iacr.org/2015/996.pdf shows that multi-user attacks against key-prefixed Schnorr signatures are no easier than single-user attacks against unprefixed signatures. Signatures are malleable: if s' = ± s (mod l) where s is a valid response, s' is also a valid response. Similarly, a valid signature for a challenge c verifies for any c' = ± c (mod l). As with all Schnorr signatures, leaking or reusing an ephemeral secret trivially compromises the identity secret, and more generally, bias in random key generation across many signatures will leak key bits. For deterministic signatures, clone the duplex state after absorbing the message and public identity, absorb the identity secret on top then squeeze the ephemeral secret, in the style of Ed25519's hashing. Discard the cloned state and continue as before, absorbing the ephemeral public key into the original duplex and squeezing a challenge. This eliminates the risk of reusing an ephemeral key and the need for unbiased entropy during signing. See https://eprint.iacr.org/2012/309.pdf for details on STROBE signatures, and also https://eprint.iacr.org/2017/518.pdf for the qDSA scheme. Secret sharing ============== A simple constant-time implementation of Shamir's secret sharing scheme over bitsliced GF(256) is included, allowing 32-byte keys to be divided between up to 255 key-holders, then later reconstructed from a sufficiently large subset of shares. Prototypes for these operations are in shamir.h and code calling them must be linked against shamir.c. Copy both files into your tree to use them. Splitting secrets ----------------- To generate n 33-byte shares of a 32-byte secret, call shamir_split(share, index, threshold, secret, entropy); for each index = 0, 1, ..., n - 1, where threshold is the minimum number of shares that will be required to reconstruct the secret. index and threshold must not exceed 254 and 255 respectively. The scheme's security depends heavily on the block of 32 * (threshold - 1) bytes supplied as uint8_t entropy[threshold - 1][32], which is used to set random coefficients of a polynomial over GF(256). Randomise this before first calling shamir_split(), leave it unchanged for subsequent invocations, then securely discard it once all the shares are generated. Flaws in the randomness of the entropy (or leaks of it) will compromise the secret. Reconstructing secrets ---------------------- Given a quorum in uint8_t shares[][33], reconstruct the 32-byte secret with shamir_combine(secret, count, shares); where count is the number of 33-byte shares provided. This will silently generate an incorrect secret if too few shares are supplied or if they are inconsistent/corrupt. Testing and installing the library ================================== To run the test suite and basic benchmarks, use make test The easiest way to link against this library is to copy the .c and .h files directly into your source tree. However, an object library can also be built and installed along with the header files. Run make install-shared to install libpocketcrypt.so into PREFIX/lib and the pocketcrypt .h files into PREFIX/include/pocketcrypt. Similarly make install-static installs libpocketcrypt.a and the header files. The usual DESTDIR and PREFIX variables are respected, as well as CC, CFLAGS, BINDIR, INCDIR and LIBDIR for more detailed control of compilation and install paths. Copying ======= x25519.c was originally written by Mike Hamburg as part of the STROBE protocol framework, and is distributed as Free Software under the terms of the MIT license by Cryptography Research, Inc. The rest of the software (including duplex.h) and this documentation were written by Chris Webb <[email protected]> and the combined library is distributed as Free Software under the terms of the MIT license in COPYING.