few more comments about style and optimization:
First, the thing that make optimized code hard to understand is that (at least when a compiler does it) the optimized program goes through a different set of states than the user expected from the source code. When a loop is unrolled, or invariant expressions moved out of a loop, or indexing is turned into pointer manipulation, if you don't know what's happening its mysterious. But once you'v seen this a few times, it's only a momentary hitch...
Also, one lesson I learned from Unix is that if you can make a program 10x faster it becomes a qualitatively different experience. The first version of Yacc I wrote, using the textbook algorithms, took 20 minutes to process a 30-rule grammar. On the shared PDP-11, when I ran Yacc everyone else's jobs slowed to a crawl, and people were heard to mutter "*%&#! Johnson's running Yacc again!" A couple of years later, Yacc produced parsers faster than the C compiler could compile them.
Consider what using Google would be like if it took 7 seconds rather than .7 seconds to respond. Probably still useful, but a different user experience.
We had many decades to be spoiled as far as program performance was concerned. As our tools got more bloated, faster processors and more memory hid the true effect from most of us. Now we have to rethink everything to make it parallel. And complexity is enemy number one when making something parallel...
Steve