Thanks for the promotion to CTO Clem! I was merely the software architect at Sicortex.
The SC systems had 6-core MIPS-64 cpus at 700 MHz, two channel DDR-2, and a really fast
interconnect. (Seriously fast for its day 800 ns PUT, 1.6 uS GET, north of 2 GB/sec
end-to-end, this was in 2008.) The achilles heel was low memory bandwidth due to a core
limitation of a single outstanding miss. The new chip would have fixed that (and about 8x
performance) but we ran out of money in 2009, which was not a good time to look for more.
We had delighted customers who appreciated the reliabiity and the network. For latency
limited codes we did extremely well (GUPS) and still did well on the rest from a
flops/watt perspective. However, lots of commercial prospects didn’t have codes that
needed the network and did need single stream performance. We talked to Urs Holtze at
Google and he was very clear - they needed fast single threads. The low power was very
nice, … but we were welcome to try and parallelize their benchmarks.
Which brings me back to the original issue - does C constrain our architectural thinking?
I’ve spent a fair amount of time recently digging into Nyx, which is an adaptive mesh
refinement cosmological hydrodynamics code. The framework is in C++ because the
inheritance stuff makes it straightforward to adapt the AMR machinery to different
problems. This isn’t the kind of horrible C++ that you can’t tell what is going to
happen, but pretty close to C style in which you can visualize what the compiler will do.
The “solvers” tend to be Fortran modules, because, I think, Fortran is just sensible about
multidimensional arrays and indexing in a way you have to use weird macros to replicate in
C. It isn’t I think that C or C++ compilers cannot generate good code - it is about the
syntax for arrays.
For anyone interested in architectural arm wrestling, memory IS the main issue. It is
worth reading the papers on BLIS, an analytical model for writing Basic Linear Algebra
libraries. Once you figure out the flops per byte, you are nearly done - the rest is
complicated but straightforward code tuning. Matrix multiply has O(n^3) computation for
O(n^2) memory and that immediately says you can get close to 100% of the ALUs running if
you have a clue about blocking in the caches. This is just as easy or hard to do in C as
in Fortran. The kernels tend to wind up in asm(“”) no matter what you wish for just in
order to get the prefetch instructions placed just so. As far as I can tell, compilers
still do not have very good models for cache hierarchies although there isn’t really any
reason why they shouldn’t. Similarly, if your code is mainly doing inner products, you
are doomed to run at memory speeds rather than ALU speeds. Multithreading usually doesn’t
help, because often other cores are farther away than main memory.
My summary of the language question comes down to: if you knew what code would run fast,
you could code it in C. Thinking that a new language will explain how to make it run fast
is just wishful thinking. It just pushes the problem onto the compiler writers, and they
don’t know how to code it to run fast either. The only argument I like for new languages
is that at least they might be able to let you describe the problem in a way that others
will recognize. I’m sure everyone here has has the sad experience of trying to figure out
what is the idea behind a chunk of code. Comments are usually useless. I wind up trying
to match the physics papers with the math against the code and it makes my brain hurt. It
sure would be nice if there were a series of representations between math and hardware
transitioning from why to what to how. I think that is was Steele was trying to do with
Fortress.
I do think the current environment is the best for architectural innovation since the
‘90s. We have The Machine, we have Dover Micro trying to add security, we have
Microsoft’s EDGE stuff, and the multiway battle between Intel/AMD/ARM and the GPU guys and
the FPGA guys. It is a lot more interesting than 2005!
On 2018, Jun 28, at 11:37 AM, Clem Cole
<clemc(a)ccc.com> wrote:
On Thu, Jun 28, 2018 at 10:40 AM, Larry McVoy <lm(a)mcvoy.com
<mailto:lm@mcvoy.com>> wrote:
Yep. Lots of cpus are nice when doing a parallel make but there is
always some task that just uses one cpu. And then you want the fastest
one you can get. Lots of wimpy cpus is just, um, wimpy.
Larry Stewart would be better to reply as SiCortec's CTO - but that was the basic
logic behind their system -- lots of cheap MIPS chips. Truth is they made a pretty neat
system and it scaled pretty well. My observation is that they, like most of the attempts
I have been a part, in the end architecture does not matter nearly as much as economics.
In my career I have build 4 or 5 specially architecture systems. You can basically live
through one or two generations using some technology argument and 'win'. But
in the end, people buy computers to do a job and the really don't give a s*t about
how the job gets done, as long as it get done cheaply. Whoever wins the economic war
has the 'winning' architecture. Look x66/Intel*64 would never win awards as a
'Computer Science Architecture' or in SW side; Fortran vs. Algol etc...;
Windows beat UNIX Workstations for the same reasons... as well know.
Hey, I used to race sailboats ... there is a term called a 'sea lawyer' -
where you are screaming you have been fouled but you drowning as your boating is sinking.
I keep thinking about it here. You can scream all you want about goodness or badness of
architecture or language, but in the end, users really don't care. They buy
computers to do a job. You really can not forget that is the purpose.
As Larry says: Lots of wimpy cpus is just wimpy. Hey, Intel, nVidia and AMD's job
is sell expensive hot rocks. They are going to do what they can to make those rocks
useful for people. They want to help people get there jobs done -- period. That is what
they do. Amtel and RPi folks take the 'jelly bean' approach - which is one of
selling enough it make it worth it for the chip manufacture and if the simple machine can
do the customer job, very cool. In those cases simple is good (hey the PDP-11 is pretty
complex compared to say the 6502).
So, I think the author of the paper trashing as too high level C misses the point, and
arguing about architecture is silly. In the end it is about what it costs to get the job
done. People will use what it is the most economically for them.
Clem