lib25519: Internals

This document explains the internal structure of lib25519, and explains how to add new instruction sets and new implementations.

Primitives

The directories crypto_*/* inside lib25519 define the following primitives (see also autogen/test for Python versions of the mathematical primitives):

crypto_verify/32: crypto_verify_32(s,t) returns 0 when the 32-byte arrays s and t are equal, otherwise -1. This function takes constant time.
crypto_hashblocks/sha512: crypto_hashblocks_sha512(h,x,xlen) updates an intermediate SHA-512 hash h using all of the full 128-byte blocks at the beginning of the xlen-byte array x, and returns the number of bytes left over, namely xlen mod 128. This function takes time that depends on xlen but not on the contents of h or x.
crypto_hash/sha512: crypto_hash_sha512(h,x,xlen) computes the SHA-512 hash h of the xlen-byte array x. This function takes time that depends on xlen but not on the contents of x.
crypto_pow/inv25519: crypto_pow_inv25519(y,x) computes the 2²⁵⁵−21 power y of an integer x modulo 2²⁵⁵−19. This is the same as the inverse of x modulo 2²⁵⁵−19 if x is not divisible by 2²⁵⁵−19. Each of the integers x and y is represented as a 32-byte array in little-endian form. This function takes constant time.

This function guarantees that the output y is frozen modulo 2²⁵⁵−19, i.e., completely reduced to the range 0,1,...,2²⁵⁵−20. The caller is expected to freeze x before calling this function. The function accepts x in the range {0,1,...,2²⁵⁶−1} while ignoring the top bit (the coefficient of 2²⁵⁵ in binary): i.e., the function reduces x modulo 2²⁵⁵ and then modulo 2²⁵⁵−19.
crypto_powbatch/inv25519: crypto_powbatch_inv25519(y,x,batch) is equivalent to (but can be faster than) batch separate calls to crypto_pow_inv25519. Each of y and x is an array containing 32*batch bytes, with the first 32 bytes of y computed from the first 32 bytes of x, the next 32 bytes of y computed from the next 32 bytes of x, etc. This function takes time that depends on batch but not on the other inputs.
crypto_nP/montgomery25519: crypto_nP_montgomery25519(nP,n,P) computes the X25519 function: in short, if a Curve25519 point has x-coordinate P then the nth multiple of the point has x-coordinate nP. The inputs and outputs are represented as 32-byte arrays in little-endian form. This function takes constant time.

X25519 is defined for n in the range 2²⁵⁴ + 8{0,1,2,3,...,2²⁵¹−1}. crypto_nP_montgomery25519 allows n in the wider range {0,1,...,2²⁵⁶−1}, and in all cases computes mth multiples where m is defined as follows: make a copy of n, clear the top bit, set the next bit, and clear the bottom three bits.

X25519 guarantees that the output nP is frozen. It does not require the input to be frozen; also, it allows the input to be on the twist, and to have small order.

crypto_nP_montgomery25519 clears the top bit of P before applying the X25519 function. Callers that want the X25519 function on P with the top bit set have to reduce modulo 2²⁵⁵−19 for themselves.
crypto_nPbatch/montgomery25519: crypto_nPbatch_montgomery25519(nP,n,P,batch) is equivalent to (but can be faster than) batch separate calls to crypto_nP_montgomery25519. Each of nP, n, and P is an array containing 32*batch bytes, with the first 32 bytes of nP computed from the first 32 bytes of n and the first 32 bytes of P, the next 32 bytes of nP computed from the next 32 bytes of n and the next 32 bytes of P, etc. This function takes time that depends on batch but not on the other inputs.
crypto_nG/merged25519: crypto_nG_merged25519(nG,n) reads an integer n in the range {0,1,...,2²⁵⁶−1} and outputs a frozen integer nG modulo 2²⁵⁵−19, possibly with the top bit set (i.e., adding 2²⁵⁵) as described below. Both n and nG are represented as 32-byte arrays in little-endian form. This function takes constant time.

If the top bit of n is clear then nG is the Edwards y-coordinate of the nth multiple of G, and the top bit is set exactly when the Edwards x-coordinate is odd. Otherwise nG is the Montgomery x-coordinate of the (n−2²⁵⁵)th multiple of G, and the top bit is clear. Here G is the standard Curve25519 base point, which has Montgomery x-coordinate 9, Edwards y-coordinate 4/5, and even Edwards x-coordinate.
crypto_nG/montgomery25519: crypto_nG_montgomery25519(nG,n) is the same as crypto_nP_montgomery(nG,n,G) where G is the array {9,0,0,...,0}. This function takes constant time.

The point of crypto_nG is to save time (using a small table precomputed from G) compared to the more general crypto_nP. This has the disadvantage of being more complicated, which is particularly important given that lib25519 has not yet been verified, and in any case increases code size noticeably for X25519. There is a ref implementation of crypto_nG that simply calls crypto_nP, and setting sticky bits on the other implementation directories (chmod +t crypto_nG/montgomery25519/*; chmod -t crypto_nG/montgomery25519/ref) will force lib25519 to use ref.
crypto_mGnP/ed25519: crypto_mGnP_ed25519(mGnP,m,n,P) computes (m mod L)G−(n mod L)P in Edwards coordinates, where L is the prime number 2²⁵²+27742317777372353535851937790883648493 and G is the same standard base point. This function takes time that depends on the inputs.

The input m is an integer in the range {0,1,...,2²⁵⁶−1} represented as a 32-byte array in little-endian form. Any m outside the range {0,1,...,L−1} triggers a failure, which is reported as described below.

The input n is an integer in the range {0,1,...,2⁵¹²−1} represented as a 64-byte array in little-endian form.

The input point P is represented as a 32-byte array as follows: the (frozen) Edwards y-coordinate of P in {0,1,...,2²⁵⁵−20} is stored in little-endian form, and then the top bit is set exactly when the (frozen) Edwards x-coordinate of P is odd. An input 32-byte array that does not have this form is instead interpreted as the negative of the point P with Edwards coordinates (...8,26), and triggers a failure, reported as described below.

The output is a 33-byte array. The first 32 bytes are the output point (m mod L)G−(n mod L)P, represented the same way as P. The last byte is 1 on success and 0 on failure.
crypto_multiscalar/ed25519: crypto_multiscalar_ed25519(Q,n,P,len) computes (n[0] mod L)P[0]+(n[1] mod L)P[1]+...+(n[len-1] mod L)P[len-1] where L is the same prime number as for crypto_mGnP/ed25519. This function takes time that depends on the inputs.

The input n is a 32*len-byte array, viewed as a concatenation of len 32-byte arrays. The 32-byte arrays are little-endian representations of integers n[0], n[1], ..., n[len-1] in the range {0,1,...,2²⁵⁶−1}. Any n[j] outside the range {0,1,...,L−1} triggers a failure, which is reported as described below.

The input P is a 32*len-byte array, viewed as a concatenation of len 32-byte arrays. The 32-byte arrays are points P[0], P[1], ..., P[len-1], each represented as follows: the (frozen) Edwards y-coordinate of P[j] in {0,1,...,2²⁵⁵−20} is stored in little-endian form, and then the top bit is set exactly when the (frozen) Edwards x-coordinate of P[j] is odd. An input 32-byte array that does not have this form is instead interpreted as the point P with Edwards coordinates (...8,26), and triggers a failure, reported as described below.

The output Q is a 33-byte array. The first 32 bytes are the output point (n[0] mod L)P[0]+(n[1] mod L)P[1]+...+(n[len-1] mod L)P[len-1], represented the same way as each P[j]. The last byte is 1 on success and 0 on failure.
crypto_dh/x25519: crypto_dh_x25519_keypair(pk,sk) generates a 32-byte X25519 public key pk and the corresponding 32-byte secret key sk. This function is the composition of randombytes to generate sk and crypto_nG_montgomery25519 to generate pk.

crypto_dh_x25519(k,pk,sk) generates a 32-byte shared secret k given a public key pk and a secret key sk. This function is the same as crypto_nP_montgomery25519.
crypto_sign/ed25519: crypto_sign_ed25519_keypair(pk,sk) generates a 32-byte Ed25519 public key pk and the corresponding 64-byte secret key sk. This function takes constant time.

crypto_sign_ed25519(sm,&smlen,m,mlen,sk) generates an smlen-byte signed message sm given an mlen-byte message m and a secret key sk. The caller is required to allocate mlen+64 bytes for sm. The function always sets smlen to mlen+64. This function takes time that depends on mlen but not on the other inputs.

crypto_sign_ed25519_open(m,&mlen,sm,smlen,pk) generates an mlen-byte message m given an smlen-byte signed message sm and a public key pk, and returns 0. However, if sm is invalid, this function returns -1, sets mlen to -1, and clears m. The caller is required to allocate smlen (not just smlen-64) bytes for m, for example using the same array for sm and m. This function takes time that depends on its inputs.

lib25519 includes a command-line utility lib25519-test that runs some tests for each of these primitives, and another utility lib25519-speed that measures cycle counts for each of these primitives.

The stable lib25519 API functions are built from the above primitives:

lib25519_dh_keypair is crypto_dh_x25519_keypair.
lib25519_dh is crypto_dh_x25519.
lib25519_sign_keypair is crypto_sign_ed25519_keypair.
lib25519_sign is crypto_sign_ed25519.
lib25519_sign_open is crypto_sign_ed25519_open.

Some changes are anticipated in the list of primitives, but these API functions will remain stable.

As in SUPERCOP and NaCl, message lengths intentionally use long long, not size_t. In lib25519, message lengths are signed.

Implementations

A single primitive can, and usually does, have multiple implementations. Each implementation is in its own subdirectory. The implementations are required to have exactly the same input-output behavior, and to some extent this is tested, although it is not yet formally verified.

Different implementations typically offer different tradeoffs between portability, simplicity, and efficiency. For example, crypto_nP/montgomery25519/ref10 is portable; crypto_nP/montgomery25519/amd64-maax is faster and less portable.

Each unportable implementation has an architectures file. Each line in this file identifies a CPU instruction set (and ABI) where the implementation works. For example, crypto_nP/montgomery25519/amd64-maax/architectures has one line amd64 bmi2 adx, meaning that the implementation works on CPUs that have the Intel/AMD 64-bit instruction set with the BMI2 and ADX instruction-set extensions. The top-level compilers directory shows (among other things) the allowed instruction-set names such as bmi2.

At run time, lib25519 checks the CPU where it is running, and selects an implementation where architectures is compatible with that CPU. Each primitive makes its own selection once per program startup, using the compiler's ifunc mechanism (or constructor on platforms that do not support ifunc). This type of run-time selection means, for example, that an amd64 CPU without AVX2 can share binaries with an amd64 CPU with AVX2. However, correctness requires instruction sets to be preserved by migration across cores via the OS kernel, VM migration, etc.

The compiler has a target mechanism that makes an ifunc selection based on CPU architectures. Instead of using the target mechanism, lib25519 uses a more sophisticated mechanism that also accounts for benchmarks collected in advance of compilation.

Compilers

lib25519 tries different C compilers for each implementation. For example, compilers/default lists the following compilers:

gcc -Wall -fPIC -fwrapv -O2
clang -Wall -fPIC -fwrapv -Qunused-arguments -O2

Sometimes gcc produces better code, and sometimes clang produces better code.

As another example, compilers/amd64+sse3+ssse3+sse41+sse42+avx lists the following compilers:

gcc -Wall -fPIC -fwrapv -O2 -mmmx -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mtune=sandybridge
clang -Wall -fPIC -fwrapv -Qunused-arguments -O2 -mmmx -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mtune=sandybridge

The -mavx option tells these compilers that they are free to use the AVX instruction-set extension.

Code compiled using the compilers in compilers/amd64+sse3+ssse3+sse41+sse42+avx will be considered at run time by the lib25519 selection mechanism if the supports() function in compilers/amd64+sse3+ssse3+sse41+sse42+avx.c returns nonzero. This function checks whether the run-time CPU supports AVX (and SSE and so on, and OSXSAVE with XMM/YMM being saved; https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85100 says that all versions of gcc until 2018 handled this incorrectly in target). Similar comments apply to other compilers/* files.

If some compilers fail (for example, clang is not installed, or the compiler version is too old to support the compiler options used in lib25519), the lib25519 compilation process will try its best to produce a working library using the remaining compilers, even if this means lower performance.

Trimming

By default, to reduce size of the compiled library, the lib25519 compilation process trims the library down to the implementations that are selected by lib25519's selection mechanism (across all CPUs; the library remains portable, not tied to the compilation CPU).

This trimming is handled at link time rather than compile time to increase the chance that, even if some implementations are broken by compiler "upgrades", the library will continue to build successfully.

To avoid this trimming, pass the --no-trim option to ./configure. All implementations that compile are then included in the library, tested by lib25519-test, and measured by lib25519-speed. You'll want to avoid trimming if you're adding new instruction sets or new implementations (see below), so that you can run tests and benchmarks of code that isn't selected yet.

How to recompile after changes

If you make changes in the lib25519 source directory, the fully supported recompilation mechanism is to run ./configure again to clean and repopulate the build directory, and then run make again to recompile everything.

This can be on the scale of seconds if you have enough cores, but maybe you're developing on a slower machine. Three options are currently available to accelerate the edit-compile cycle:

There is an experimental --no-clean option to ./configure that, for some simple types of changes, can produce a successful build without cleaning.
Running make without ./configure can work for some particularly simple types of changes. However, not all dependencies are currently expressed in Makefile, and some types of dependencies that ./configure understands would be difficult to express in the Makefile language.
You can disable the implementations you're not using by setting sticky bits on the source directories for those implementations: e.g., chmod +t crypto_nG/*/*avx2*.

Make sure to reenable all implementations and do a full clean build if you're collecting data to add to the source benchmarks directory.

How to add new instruction sets

Adding another file compilers/amd64+foo, along with a supports() implementation in compilers/amd64+foo.c, will support a new instruction set. Do not assume that the new foo instruction set implies support for older instruction sets (the idea of "levels" of instruction sets); instead make sure to include the older instruction sets in + tags, as illustrated by compilers/amd64+sse3+ssse3+sse41+sse42+avx.

In the compiler options, always make sure to include -fPIC to support shared libraries, and -fwrapv to switch to a slightly less dangerous version of C.

The foo tags don't have to be instruction sets. For example, if a CPU has the same instruction set but wants different optimizations because of differences in instruction timings, you can make a tag for those optimizations, using, e.g., CPU IDs or benchmarks in the corresponding supports() function to decide whether to enable those optimizations. Benchmarks tend to be more future-proof than a list of CPU IDs, but the time taken for benchmarks at program startup has to be weighed against the subsequent speedup from the resulting optimizations.

To see how well lib25519 performs with the new compilers, run lib25519-speed on the target machine and look for the foo lines in the output. If the new performance is better than the performance shown on the selected lines:

Copy the lib25519-speed output into a file on the benchmarks directory, typically named after the hostname of the target machine.
Run ./prioritize in the top-level directory to create priority files. These files tell lib25519 which implementations to select for any given architecture.
Reconfigure (again with --no-trim), recompile, rerun lib25519-test, and rerun lib25519-speed to check that the selected lines now use the foo compiler.

If the foo implementation is outperformed by other implementations, then these steps don't help except for documenting this fact. The same implementation might turn out to be useful for subsequent foo CPUs.

How to add new implementations

Taking full advantage of the foo instruction set usually requires writing new implementations. Sometimes there are also ideas for taking better advantage of existing instruction sets.

Structurally, adding a new implementation of a primitive is a simple matter of adding a new subdirectory with the code for that implementation. Most of the work is optimizing the use of foo intrinsics in .c files or foo instructions in .S files. Make sure to include an architectures file saying, e.g., amd64 avx2 foo.

Names of implementation directories can use letters, digits, dashes, and underscores. Do not use two implementation names that are the same when dashes and underscores are removed.

All .c and .S files in the implementation directory are compiled and linked. There is no need to edit a separate list of these files. You can also use .h files via the C preprocessor.

If an implementation is actually more restrictive than indicated in architectures then the resulting compiled library will fail on some machines (although perhaps that implementation will not be used by default). Putting unnecessary restrictions into architectures will not create such failures, but can unnecessarily limit performance.

Some, but not all, mistakes in architectures will produce warnings from the checkinsns script that runs automatically when lib25519 is compiled. Running the lib25519-test program tries all implementations, but only on the CPU where lib25519-test is being run, and lib25519-test does not guarantee code coverage: for example, other message lengths being signed could involve other code paths.

amd64 implies little-endian, and implies architectural support for unaligned loads and stores. Beware, however, that the Intel/AMD vectorized load/store intrinsics (and the underlying movdqa instruction) require alignment; if in doubt, use loadu/storeu (and movdqu). The lib25519-test program checks unaligned inputs and outputs, but can miss issues with unaligned stack variables.

To test your implementation, compile everything, check for compiler warnings and errors, run lib25519-test (or just lib25519-test nG to test a crypto_nG implementation), and check for a line saying all tests succeeded. To use AddressSanitizer (for catching, at run time, buffer overflows in C code), add -fsanitize=address to the gcc and clang lines in compilers/*; you may also have to add return; at the beginning of the limits() function in command/limits.inc.

To see the performance of your implementation, run lib25519-speed. If the new performance is better than the performance shown on the selected lines, follow the same steps as for a new instruction set: copy the lib25519-speed output into a file on the benchmarks directory; run ./prioritize in the top-level directory to create priority files; reconfigure (again with --no-trim); recompile; rerun lib25519-test; rerun lib25519-speed; check that the selected lines now use the new implementation.

How to handle namespacing

As in SUPERCOP and NaCl, to call crypto_hash_sha512(), you have to include crypto_hash_sha512.h; but to write an implementation of crypto_hash_sha512(), you have to instead include crypto_hash.h and define crypto_hash. Similar comments apply to other primitives.

The function name that's actually linked might end up as, e.g., lib25519_hash_sha512_blocksplusavx_C2_hash where blocksplusavx indicates the implementation and C2 indicates the compiler. Don't try to build this name into your implementation.

If you have another global symbol x (for example, a non-static function in a .c file, or a non-static variable outside functions in a .c file), you have to replace it with CRYPTO_NAMESPACE(x), for example with #define x CRYPTO_NAMESPACE(x).

For global symbols in .S files and shared-*.c files, use CRYPTO_SHARED_NAMESPACE instead of CRYPTO_NAMESPACE. For .S files that define both x and _x to handle platforms where x in C is _x in assembly, use CRYPTO_SHARED_NAMESPACE(x) and _CRYPTO_SHARED_NAMESPACE(x); CRYPTO_SHARED_NAMESPACE(_x) is not sufficient.

lib25519 includes a mechanism to recognize files that are copied across implementations (possibly of different primitives) and to unify those into a file compiled only once, reducing the overall size of the compiled library and possibly improving cache utilization. To request this mechanism, include a line

// linker define x

for any global symbol x defined in the file, and a line

// linker use x

for any global symbol x used in the file from the same implementation (not crypto_* subroutines that you're calling, randombytes, etc.). This mechanism tries very hard, perhaps too hard, to avoid improperly unifying files: for example, even a slight difference in a .h file included by a file defining a used symbol will disable the mechanism.

Typical namespacing mistakes will produce either linker failures or warnings from the checknamespace script that runs automatically when lib25519 is compiled.

Version: This is version 2024.09.26 of the "Internals" web page.