provide more accurate root tables
publish error bounds

review UltraSPARC profiles
do properly scheduled UltraSPARC asm
avoid partial-register stalls for single precision
consider prefetching for large transforms

review Pentium profiles
do properly scheduled Pentium/PMMX asm
pass parameters in registers
organize asm to fall through function entry when possible
organize asm to reduce i-cache pressure

investigate PPro/PII/PIII in more detail
speed up real transforms on PPro/PII/PIII
do properly scheduled PPro/PII/PIII asm

investigate other chips

support larger sizes
analyze L1 cache boundary more carefully
analyze organization of root tables
analyze L2 access patterns
analyze L2 cache boundary more carefully
analyze DRAM access patterns
measure effects of single-pass transposes
measure effects of multiple-pass transposes
consider other data structures

consider inline macros in .h files for small sizes
