I don’t really think out-of-order in hardware causes trouble for programmers that wasn’t
already there when you use -O3. Compilers will already promote memory to registers and do
interprocedural optimization and reorder memory references. You have to sprinkle
asm volatile("" ::: "memory");
around like pixie dust to make sure the compiler does things in the order you want,
nevermind the hardware.
x86 has wildly complex microarchitecture, but the model for a single thread is perfectly
sensible. It seems to work like it was in-order. OOO isn’t changing that model of
execution at all. I mean you care about it when performance tuning, but not for
correctness.
Other architectures, ARM, IBM, Alpha, are far worse in this respect.
The real problems are when you have multicore in a shared memory space and you have to
read all the fine print in the memory ordering chapter and understand the subtle
differences between LFENCE and SFENCE and MFENCE. I get that, but I also think shared
memory is a failed experiment and we should have gone with distributed memory and clusters
and message passing. It is possible for mortals to program those. Regular folks can
make 1000 rank MPI programs work, where almost noone other than Maurice Herlihy or like
that can reason about locks and threads and shared data structures.
My wish for a better world is to integrate messaging into the architecture rather than
have an I/O device model for communications. It is criminal that machine to machine comms
are still stuck at 800 nanoseconds or so latency. It takes 200 instructions or so to send
a message under the best circumstances and a similar number to receive it, plus bus,
adapter, wire, and switch time.
-L