lib25519 draws on many previous implementations listed below, plus new
speedups from Kaushik Nath and new infrastructure work and factoring
from Daniel J. Bernstein. All software is in the public domain. Since
some organizations require licenses, lib25519 is also CC0-licensed.

Some code was originally copied from public-domain code in the SUPERCOP
benchmarking framework. See https://bench.cr.yp.to/supercop.html for
SUPERCOP releases. The following small changes from code available in
SUPERCOP are made in lib25519 without further comment:

   * Returning void rather than int for functions that never fail in
     lib25519.
   * Message lengths long long rather than unsigned long long.
   * Defining various constants in .c files (to simplify PIC handling)
     instead of .S files.
   * Moving some C files to shared-*.c (which in lib25519 means that
     these files are compiled by only one compiler).
   * Using CRYPTO_SHARED_NAMESPACE rather than CRYPTO_NAMESPACE for
     symbols defined in *.S and shared-*.c.
   * Adding various "linker define" and "linker use" lines.

Larger changes from code in SUPERCOP, such as new code divisions across
lib25519 primitives, are indicated below.

Sources of Curve25519 software (this is not a comprehensive list, just
the software that lib25519 is derived from):

   * Daniel J. Bernstein. "Curve25519: new Diffie-Hellman speed
     records." Pages 207–228 in Public key cryptography—PKC 2006, 9th
     international conference on theory and practice in public-key
     cryptography, New York, NY, USA, April 24–26, 2006, proceedings,
     edited by Moti Yung, Yevgeniy Dodis, Aggelos Kiayias, Tal Malkin,
     Lecture Notes in Computer Science 3958, Springer, 2006, ISBN
     3-540-33851-9.

     This is the source of the Curve25519 design, the X25519 design, and
     various speedups. Most of the software from that paper is specific
     to a variety of 32-bit platforms (radix 2^25.5 or radix 2^21.25),
     but the portable supercop/crypto_scalarmult/curve25519/ref10 (radix
     2^25.5) is derived from this.

     lib25519/crypto_nP/montgomery25519/ref10 starts with
     supercop/crypto_scalarmult/curve25519/ref10, and tweaks the API to
     provide crypto_nP instead of crypto_scalarmult. Inversion is
     factored out, producing crypto_pow/inv25519/ref10. The trivial
     crypto_scalarmult_base wrapper is factored out, producing
     crypto_nG/montgomery25519/ref/base.c; lib25519 has faster nG
     functions, but intentionally provides ref for situations where
     speed is outweighed by simplicity, assurance, code size, etc.

   * supercop/crypto_scalarmult/curve25519/donna_c64 (radix 2^51) from
     Adam Langley.

     lib25519/crypto_nP/montgomery25519/donna_c64 starts from this and
     tweaks the API to provide crypto_nP instead of crypto_scalarmult
     (and removes crypto_scalarmult_base). crypto_pow/inv25519/donna_c64
     is factored out.

   * Daniel J. Bernstein, Niels Duif, Tanja Lange, Peter Schwabe, Bo-Yin
     Yang. "High-speed high-security signatures." Pages 124–142 in
     Cryptographic hardware and embedded systems—CHES 2011, 13th
     international workshop, Nara, Japan, September 28–October 1, 2011,
     proceedings, edited by Bart Preneel, Tsuyoshi Takagi, Lecture Notes
     in Computer Science 6917, Springer, 2011, ISBN 978-3-642-23950-2.
     Journal version: Journal of Cryptographic Engineering 2 (2012),
     77–89. 

     This is the source of the Ed25519 design and various X25519/Ed25519
     speedups for 64-bit Intel/AMD platforms, in particular producing
     supercop/crypto_{scalarmult/curve,sign/ed}25519/amd64-{51,64}*
     (radix 2^51 and radix 2^64 respectively). Peter Schwabe led the
     implementation work.

     lib25519/crypto_nP/montgomery25519/amd64-51 starts from
     supercop/crypto_scalarmult/curve25519/amd64-51 and tweaks the API
     to provide crypto_nP instead of crypto_scalarmult (and removes
     crypto_scalarmult_base). crypto_nG/merged25519/amd64-51 (for
     fixed-base-point multiplication), crypto_mGnP/ed25519/amd64-51 (for
     double-scalar multiplication), and crypto_sign/ed25519/amd64 (for
     the remaining signing components) factor
     supercop/crypto_sign/ed25519/amd64-51 into smaller pieces.
     crypto_pow/inv25519/amd64-51 is also factored out. "SMALLTABLES"
     support is removed. Support for batch verification is removed,
     although it could reappear in a subsequent lib25519 release.

     Similar comments apply to amd64-64 and ref10. A compiler warning
     is eliminated (window size 64 in amd64-64-24k/sc25519.h).

   * Tung Chou. "Sandy2x: New Curve25519 Speed Records." SAC 2015.

     This is the source of various X25519 speedups using 256-bit vector
     instructions, specifically AVX vector instructions in Intel's Sandy
     Bridge, in particular producing
     supercop/crypto_scalarmult/curve25519/sandy2x (radix 2^25.5).

     lib25519/crypto_{nP,nG}/montgomery25519/sandy2x start from
     supercop/crypto_scalarmult/curve25519/sandy2x, and tweak the API to
     provide crypto_nP and crypto_nG instead of crypto_scalarmult and
     crypto_scalarmult_base respectively. The top bit of the incoming
     point is cleared. crypto_pow/inv25519/sandy2x is factored out.

   * Kaushik Nath and Palash Sarkar, "Efficient arithmetic in
     (pseudo-)Mersenne prime order fields", Advances in Mathematics of
     Communications 16 (2022), pages 303–348.

     Original release:
     https://github.com/kn-cs/pmp-farith/tree/master/gf-p2-255-19/SL
     https://github.com/kn-cs/pmp-farith/tree/master/gf-p2-255-19/USL1

     The "SL" software is the source of various speedups to the amd64-64
     software, producing the "maa4" versions of fe25519_mul.S,
     fe25519_square.S, and fe25519_nsquare.S. These .S files are used
     inside the following lib25519 directories:
     crypto_mGnP/ed25519/amd64-avx2-10l-maa4
     crypto_mGnP/ed25519/amd64-avx2-9l-maa4
     crypto_mGnP/ed25519/amd64-maa4
     crypto_nG/merged25519/amd64-avx2-10l-maa4
     crypto_nG/merged25519/amd64-avx2-9l-maa4
     crypto_nG/merged25519/amd64-maa4
     crypto_nP/montgomery25519/amd64-avx2-hey10l-maa4
     crypto_nP/montgomery25519/amd64-avx2-hey9l-maa4
     crypto_nP/montgomery25519/amd64-avx2-ns10l-maa4
     crypto_nP/montgomery25519/amd64-avx2-ns9l-maa4
     crypto_nP/montgomery25519/amd64-maa4
     crypto_pow/inv25519/amd64-maa4

     The "USL" software is the source of various speedups to the
     amd64-51 software, producing the "maa5" versions of fe25519_mul.S
     and fe25519_nsquare.S. These .S files are used inside the following
     lib25519 directories:
     crypto_nP/montgomery25519/amd64-avx2-hey10l-maa5
     crypto_nP/montgomery25519/amd64-avx2-hey9l-maa5
     crypto_nP/montgomery25519/amd64-avx2-ns10l-maa5
     crypto_nP/montgomery25519/amd64-avx2-ns9l-maa5
     crypto_pow/inv25519/amd64-maa5

   * Kaushik Nath and Palash Sarkar, "Security and efficiency trade-offs
     for elliptic curve Diffie-Hellman at the 128-bit and 224-bit
     security levels." J. Cryptogr. Eng. 12(1): 107-121 (2022).

     Original release:
     https://github.com/kn-cs/x25519/tree/master/intel64-mxaa-4limb
     https://github.com/kn-cs/x25519

     This "mxaa-4limb" software is the source of various speedups to
     "maa4" on CPUs supporting BMI2 instructions (e.g., Intel Haswell
     from 2013), in particular producing the "mxaa" versions of
     fe25519_mul.S and fe25519_nsquare.S. These .S files are used inside
     the following lib25519 directories:
     crypto_mGnP/ed25519/amd64-avx2-10l-mxaa
     crypto_mGnP/ed25519/amd64-avx2-9l-mxaa
     crypto_mGnP/ed25519/amd64-mxaa
     crypto_nG/merged25519/amd64-avx2-10l-mxaa
     crypto_nG/merged25519/amd64-avx2-9l-mxaa
     crypto_nG/merged25519/amd64-mxaa
     crypto_nP/montgomery25519/amd64-avx2-hey10l-mxaa
     crypto_nP/montgomery25519/amd64-avx2-hey9l-mxaa
     crypto_nP/montgomery25519/amd64-avx2-ns10l-mxaa
     crypto_nP/montgomery25519/amd64-avx2-ns9l-mxaa
     crypto_nP/montgomery25519/amd64-mxaa
     crypto_pow/inv25519/amd64-mxaa

     This software is also the source of the following three different
     Montgomery-ladder functions, where the third also builds on the
     "maax" work listed below:
     crypto_nP/montgomery25519/amd64-maa4/mladder.S
     crypto_nP/montgomery25519/amd64-mxaa/mladder.S
     crypto_nP/montgomery25519/amd64-maax/mladder.S

   * Kaushik Nath and Palash Sarkar, "Efficient arithmetic in
     (pseudo-)Mersenne prime order fields", Advances in Mathematics of
     Communications 16 (2022), pages 303–348. Original release:
     https://github.com/kn-cs/pmp-farith/tree/master/gf-p2-255-19/SLDCC

     This is the source of various speedups to "mxaa" on CPUs that also
     support ADX instructions (e.g., Intel Broadwell from 2014), in
     particular producing the "maax" versions of fe25519_mul.S,
     fe25519_square.S, and fe25519_nsquare.S. These .S files are used
     inside the following lib25519 directories:
     crypto_mGnP/ed25519/amd64-avx2-10l-maax
     crypto_mGnP/ed25519/amd64-avx2-9l-maax
     crypto_mGnP/ed25519/amd64-avx512ifma-5l-maax
     crypto_mGnP/ed25519/amd64-maax
     crypto_nG/merged25519/amd64-avx2-10l-maax
     crypto_nG/merged25519/amd64-avx2-9l-maax
     crypto_nG/merged25519/amd64-avx512ifma-5l-maax
     crypto_nG/merged25519/amd64-maax
     crypto_nP/montgomery25519/amd64-avx2-hey10l-maax
     crypto_nP/montgomery25519/amd64-avx2-hey9l-maax
     crypto_nP/montgomery25519/amd64-avx2-ns10l-maax
     crypto_nP/montgomery25519/amd64-avx2-ns9l-maax
     crypto_nP/montgomery25519/amd64-avx512-hey10l-maax
     crypto_nP/montgomery25519/amd64-avx512-hey9l-maax
     crypto_nP/montgomery25519/amd64-avx512-ns10l-maax
     crypto_nP/montgomery25519/amd64-avx512-ns9l-maax     
     crypto_nP/montgomery25519/amd64-avx512ifma-hey5l-maax
     crypto_nP/montgomery25519/amd64-avx512ifma-ns5l-maax
     crypto_nP/montgomery25519/amd64-maax
     crypto_pow/inv25519/amd64-maax

   * Kaushik Nath and Palash Sarkar, "Efficient 4-Way Vectorizations of
     the Montgomery Ladder". IEEE Trans. Computers 71(3): 712-723
     (2022). Original release:
     https://github.com/kn-cs/vec-ladder/tree/master/Curve25519

     This is the source of the "hey10l" (radix 2^25.5), "hey9l" (radix
     2^29), "ns10l" (radix 2^25.5), and "ns9l" (radix 2^29) versions of
     mladder.S for CPUs that also support 256-bit AVX2 instructions
     (e.g., Intel Haswell from 2013). In lib25519, these four .S files
     are used in 16 directories:
     crypto_nP/montgomery25519/amd64-avx2-hey10l-{maa4,maa5,maax,mxaa}
     crypto_nP/montgomery25519/amd64-avx2-hey9l-{maa4,maa5,maax,mxaa}
     crypto_nP/montgomery25519/amd64-avx2-ns10l-{maa4,maa5,maax,mxaa}
     crypto_nP/montgomery25519/amd64-avx2-ns9l-{maa4,maa5,maax,mxaa}

   * Kaushik Nath, new Montgomery-ladder code new in lib25519 (no paper
     yet) for CPUs supporting AVX-512 instructions (e.g., Intel
     Skylake-X from 2017). These are six files in lib25519:
     crypto_nP/montgomery25519/amd64-avx512-hey10l-maax
     crypto_nP/montgomery25519/amd64-avx512-hey9l-maax
     crypto_nP/montgomery25519/amd64-avx512-ns10l-maax
     crypto_nP/montgomery25519/amd64-avx512-ns9l-maax
     crypto_nP/montgomery25519/amd64-avx512ifma-hey5l-maax
     crypto_nP/montgomery25519/amd64-avx512ifma-ns5l-maax

   * Kaushik Nath, nine versions of fixed-base-point
     scalar-multiplication code new in lib25519 (no paper yet) for
     various platforms:
     crypto_nG/merged25519/amd64-avx2-10l-maa4/ge25519_base.S
     crypto_nG/merged25519/amd64-avx2-10l-maax/ge25519_base.S
     crypto_nG/merged25519/amd64-avx2-10l-mxaa/ge25519_base.S
     crypto_nG/merged25519/amd64-avx2-9l-maa4/ge25519_base.S
     crypto_nG/merged25519/amd64-avx2-9l-maax/ge25519_base.S
     crypto_nG/merged25519/amd64-avx2-9l-mxaa/ge25519_base.S
     crypto_nG/merged25519/amd64-avx512ifma-5l-maax/ge25519_base.S
     crypto_nG/merged25519/amd64-maa4/ge25519_base.S
     crypto_nG/merged25519/amd64-maax/ge25519_base.S
     crypto_nG/merged25519/amd64-mxaa/ge25519_base.S

   * Kaushik Nath, ten versions of double-scalar-multiplication code new
     in lib25519 (no paper yet) for various platforms. Each version has
     precompute.S and process.S:

     crypto_mGnP/ed25519/amd64-avx2-10l-maa4/ge25519_double_scalarmult_precompute.S
     crypto_mGnP/ed25519/amd64-avx2-10l-maax/ge25519_double_scalarmult_precompute.S
     crypto_mGnP/ed25519/amd64-avx2-10l-mxaa/ge25519_double_scalarmult_precompute.S
     crypto_mGnP/ed25519/amd64-avx2-9l-maa4/ge25519_double_scalarmult_precompute.S
     crypto_mGnP/ed25519/amd64-avx2-9l-maax/ge25519_double_scalarmult_precompute.S
     crypto_mGnP/ed25519/amd64-avx2-9l-mxaa/ge25519_double_scalarmult_precompute.S
     crypto_mGnP/ed25519/amd64-avx512ifma-5l-maax/ge25519_double_scalarmult_precompute.S
     crypto_mGnP/ed25519/amd64-maa4/ge25519_double_scalarmult_precompute.S
     crypto_mGnP/ed25519/amd64-maax/ge25519_double_scalarmult_precompute.S
     crypto_mGnP/ed25519/amd64-mxaa/ge25519_double_scalarmult_precompute.S

     crypto_mGnP/ed25519/amd64-avx2-10l-maa4/ge25519_double_scalarmult_process.S
     crypto_mGnP/ed25519/amd64-avx2-10l-maax/ge25519_double_scalarmult_process.S
     crypto_mGnP/ed25519/amd64-avx2-10l-mxaa/ge25519_double_scalarmult_process.S
     crypto_mGnP/ed25519/amd64-avx2-9l-maa4/ge25519_double_scalarmult_process.S
     crypto_mGnP/ed25519/amd64-avx2-9l-maax/ge25519_double_scalarmult_process.S
     crypto_mGnP/ed25519/amd64-avx2-9l-mxaa/ge25519_double_scalarmult_process.S
     crypto_mGnP/ed25519/amd64-avx512ifma-5l-maax/ge25519_double_scalarmult_process.S
     crypto_mGnP/ed25519/amd64-maa4/ge25519_double_scalarmult_process.S
     crypto_mGnP/ed25519/amd64-maax/ge25519_double_scalarmult_process.S
     crypto_mGnP/ed25519/amd64-mxaa/ge25519_double_scalarmult_process.S

Almost all of the crypto_pow/inv25519 implementations use exponentiation,
but there is also a different implementation from the following source:

   * Daniel J. Bernstein, Bo-Yin Yang. "Fast constant-time gcd
     computation and modular inversion." IACR Transactions on
     Cryptographic Hardware and Embedded Systems 2019 issue 3 (2019),
     340–398.

     This is the source of the "safegcd" algorithm and software. Further
     speedups (no paper yet; ideas from Peter Dettman, Gregory Maxwell,
     and Pieter Wuille) have produced the "inverse25519skylake" software
     available here: https://gcd.cr.yp.to/software.html

     lib25519/crypto_pow/inv25519/amd64-safegcd is inverse25519skylake,
     tweaked to provide the crypto_pow API and to clear the top bit of
     the input.

For lower-layer SHA-512 functions:

   * Daniel J. Bernstein, supercop/crypto_hash*/sha512/*. In lib25519,
     some unused variables are removed in crypto_hashblocks/sha512/avx
     to eliminate compiler warnings.

Most of the lib25519 infrastructure, such as the run-time implementation
selector automatically guided by CPU type and previous benchmarks, is
new in lib25519 from Daniel J. Bernstein. Portions of autogen-speed
(generating lib25519-speed.c) and autogen-test (generating
lib25519-test.c) are based on benchmarking software and test software in
SUPERCOP by Daniel J. Bernstein. The symmetric-cryptography code for
generating pseudorandom test inputs and hashing test outputs is adapted
from TweetNaCl, a library by Daniel J. Bernstein, Wesley Janssen, Tanja
Lange, and Peter Schwabe.