Thanks for the promotion to CTO Clem!  I was merely the software architect at Sicortex.

The SC systems had 6-core MIPS-64 cpus at 700 MHz, two channel DDR-2, and a really fast interconnect.  (Seriously fast for its day 800 ns PUT, 1.6 uS GET, north of 2 GB/sec end-to-end, this was in 2008.)  The achilles heel was low memory bandwidth due to a core limitation of a single outstanding miss.  The new chip would have fixed that (and about 8x performance) but we ran out of money in 2009, which was not a good time to look for more.

We had delighted customers who appreciated the reliabiity and the network.  For latency limited codes we did extremely well (GUPS) and still did well on the rest from a flops/watt perspective.  However, lots of commercial prospects didn’t have codes that needed the network and did need single stream performance.  We talked to Urs Holtze at Google and he was very clear - they needed fast single threads.  The low power was very nice, … but we were welcome to try and parallelize their benchmarks.

Which brings me back to the original issue - does C constrain our architectural thinking? 

I’ve spent a fair amount of time recently digging into Nyx, which is an adaptive mesh refinement cosmological hydrodynamics code.  The framework is in C++ because the inheritance stuff makes it straightforward to adapt the AMR machinery to different problems.  This isn’t the kind of horrible C++ that you can’t tell what is going to happen, but pretty close to C style in which you can visualize what the compiler will do.  The “solvers” tend to be Fortran modules, because, I think, Fortran is just sensible about multidimensional arrays and indexing in a way you have to use weird macros to replicate in C.  It isn’t I think that C or C++ compilers cannot generate good code - it is about the syntax for arrays.

For anyone interested in architectural arm wrestling, memory IS the main issue.  It is worth reading the papers on BLIS, an analytical model for writing Basic Linear Algebra libraries.  Once you figure out the flops per byte, you are nearly done - the rest is complicated but straightforward code tuning.  Matrix multiply has O(n^3) computation for O(n^2) memory and that immediately says you can get close to 100% of the ALUs running if you have a clue about blocking in the caches.  This is just as easy or hard to do in C as in Fortran.  The kernels tend to wind up in asm(“”) no matter what you wish for just in order to get the prefetch instructions placed just so.  As far as I can tell, compilers still do not have very good models for cache hierarchies although there isn’t really any reason why they shouldn’t.  Similarly, if your code is mainly doing inner products, you are doomed to run at memory speeds rather than ALU speeds.  Multithreading usually doesn’t help, because often other cores are farther away than main memory.

My summary of the language question comes down to: if you knew what code would run fast, you could code it in C.  Thinking that a new language will explain how to make it run fast is just wishful thinking.  It just pushes the problem onto the compiler writers, and they don’t know how to code it to run fast either.  The only argument I like for new languages is that at least they might be able to let you describe the problem in a way that others will recognize.  I’m sure everyone here has has the sad experience of trying to figure out what is the idea behind a chunk of code.  Comments are usually useless.  I wind up trying to match the physics papers with the math against the code and it makes my brain hurt.  It sure would be nice if there were a series of representations between math and hardware transitioning from why to what to how.  I think that is was Steele was trying to do with Fortress.

I do think the current environment is the best for architectural innovation since the ‘90s.  We have The Machine, we have Dover Micro trying to add security, we have Microsoft’s EDGE stuff, and the multiway battle between Intel/AMD/ARM and the GPU guys and the FPGA guys.  It is a lot more interesting than 2005!  

On 2018, Jun 28, at 11:37 AM, Clem Cole <clemc@ccc.com> wrote:



On Thu, Jun 28, 2018 at 10:40 AM, Larry McVoy <lm@mcvoy.com> wrote:
Yep.  Lots of cpus are nice when doing a parallel make but there is
always some task that just uses one cpu.  And then you want the fastest
one you can get.  Lots of wimpy cpus is just, um, wimpy.

​Larry Stewart would be better to reply as SiCortec's CTO - but that was the basic logic behind their system -- lots of cheap MIPS chips. Truth is they made a pretty neat system and it scaled pretty well.   My observation is that they, like most of the attempts I have been a part, in the end architecture does not matter nearly as much as economics.

In my career I have build 4 or 5 specially architecture systems.  You can basically live through one or two generations using some technology argument and 'win'.   But in the end, people buy computers to do a job and the really don't give a s*t about how the job gets done, as long as it get done cheaply.​   Whoever wins the economic war has the 'winning' architecture.   Look x66/Intel*64 would never win awards as a 'Computer Science Architecture'  or in SW side; Fortran vs. Algol etc...; Windows beat UNIX Workstations for the same reasons... as well know.

Hey, I used to race sailboats ...  there is a term called a 'sea lawyer' - where you are screaming you have been fouled but you drowning as your boating is sinking.   I keep thinking about it here.   You can scream all you want about goodness or badness of architecture or language, but in the end, users really don't care.   They buy computers to do a job.   You really can not forget that is the purpose.

As Larry says: Lots of wimpy cpus is just wimpy.    Hey, Intel, nVidia and AMD's job is sell expensive hot rocks.   They are going to do what they can to make those rocks useful for people.  They want to help people get there jobs done -- period. That is what they do.   Amtel and RPi folks take the 'jelly bean' approach - which is one of selling enough it make it worth it for the chip manufacture and if the simple machine can do the customer job, very cool.  In those cases simple is good (hey the PDP-11 is pretty complex compared to say the 6502).

So, I think the author of the paper trashing as too high level C misses the point, and arguing about architecture is silly.  In the end it is about what it costs to get the job done.   People will use what it is the most economically for them.

Clem