track history more precisely in documentation for individual source files
allow shared-* for API functions (requires tweaking dispatch)
speedups for more architectures
speedups for more microarchitectures
consider faster PRF for the keyed hash giving the nonce
merge subroutines in source to the extent possible
scan for and remove any unused functions and files
restructure for more merging at object-code level
sort object files (for, e.g., improved cache utilization)
optionally allow post-installation patching of current cpu as an exceptional cpuid
  (based on benchmarks and, with more CPU time, full functionality tests)
dispatch: eliminate, e.g., avx2 if avx is higher priority
speed up dispatch cpuid tests (lazy evaluation, merging cpuid calls)
powbatch: adjust MAXTODO considering time vs. memory usage
powbatch: use inversion tree, and try parallel mults
nPbatch: use larger batches to allow faster powbatch
nPbatch: use 0-padding of leftovers when this is faster than individual nP
multiscalar: adjust MAXTODO
multiscalar: vectorize across inputs
nG etc.: vectorize across points being added
batch versions of all other operations
vectorize all other parallelizable operations
verify constbranch, constindex
full functional verification