below...
On Sat, Jun 16, 2018 at 9:37 AM, Noel Chiappa <jnc(a)mercury.lcs.mit.edu>
wrote:
I can't speak to the motivations of everyone who repeats these stories,
but my
professional career has been littered with examples of poor vision from
technical colleagues (some of whom should have known better), against
which I
(in my role as an architect, which is necessarily somewhere where
long-range
thinking is - or should be - a requirement) have struggled again and again
-
sometimes successfully, more often, not.
Amen, although sadly many of us if not all of us have a few of these
stories. In fact, I'm fighting another one of these battles right now.🤔
My experience is that more often than not, it's less a failure to see what
a successful future might bring, and often one of well '*we don't need to
do that now/costs too much/we don't have the time*.'
That said, DEC was the epitome of the old line about perfection being the
enemy of success. I like to say to my colleagues, pick the the things
that are going to really matter. Make those perfect and bet the company on
them. But think in terms of what matters. As you point out, address size
issues are killers and you need to get those right at time t0.
Without saying too much, many firms like my own, think in terms of
computation (math libraries, cpu kernels), but frankly if I can not get the
data to/from CPU's functional units, or the data is stored in the wrong
place, or I have much of the main memory tied up in the OS managing
different types of user memory; it doesn't matter [HPC customers in
particular pay for getting a job done -- they really don't care how -- just
get it done and done fast].
To me, it becomes a matter of 'value' -- our HW folks know a crappy
computational system will doom the device, so that is what they put there
effort into building. My argument has often been that the messaging
systems, memory hierarchy and house keeping are what you have to get right
at this point. No amount of SW will fix HW that is lacking the right
support in those places (not that lots of computes are bad, but they are
actually not the big issue in the HPC when yet get down to it these days).
Let's start with the UNIBUS. Why does it have only 18 address lines? (I
have
this vague memory of a quote from Gordon Bell admitting that was a mistake,
but I don't recall exactly where I saw it.)
I think it was part of the same paper where he made the observation that
the greatest mistake an architecture can have is too few address bits.
My understanding is that the problem was that UNIBUS was perceived as an
I/O bus and as I was pointing out, the folks creating it/running the team
did not value it, so in the name of 'cost', more bits was not considered
important.
I used to know and work with the late Henk Schalke, who ran Unibus (HW)
engineering at DEC for many years. Henk was notoriously frugal (we might
even say 'cheap'), so I can imagine that he did not want to spend on
anything that he thought was wasteful. Just like I retold the
Amdahl/Brooks story of the 8-bit byte and Amdahl thinking Brooks was nuts;
I don't know for sure, but I can see that without someone really arguing
with Henk as to why 18 bits was not 'good enough.' I can imagine the
conversation going something like: Someone like me saying: *"Henk, 18 bits
is not going to cut it."* He might have replied something like: *"Bool
sheet *[a dutchman's way of cursing in English], *we already gave you two
more bit than you can address* (actually he'd then probably stop mid
sentence and translate in his head from Dutch to English - which was always
interesting when you argued with him).
Note: I'm not blaming Henk, just stating that his thinking was very much
that way, and I suspect he was not not alone. Only someone like Gordon and
the time could have overruled it, and I don't think the problems were
foreseen as Noel notes.
And a major one from the very start of my career: the decision to remove
the
variable-length addresses from IPv3 and substitute the 32-bit addresses of
IPv4.
I always wondered about the back story on that one. I do seem to remember
that there had been a proposal for variable-length addresses at one point;
but never knew why it was not picked. As you say, I was certainly way to
junior to have been part of that discussion. We just had some of the
document from you guys and we were told to try to implement it. My guess
is this is an example of folks thinking, length addressing was wasteful.
32-bits seemed infinite in those days and no body expected the network to
scale to the size it is today and will grow to in the future [I do remember
before Noel and team came up with ARP, somebody quipped that Xerox
Ethernet's 48-bits were too big and IP's 32-bit was too small. The
original hack I did was since we used 3Com board and they all shared the
upper 3 bytes of the MAC address to map the lower 24 to the IP address - we
were not connecting to the global network so it worked. Later we used a
look up table, until the ARP trick was created].
One place where I _did_ manage to win was in adding subnetting support to
hosts (in the Host Requirements WG); it was done the way I wanted, with the
result that when CIDR came along, even though it hadn't been forseen at the
time we did subnetting, it required _no_ hosts changes of any kind.
Amen and thank you.
But
mostly I lost. :-(
I know the feeling. To many battles that in hindsight you think - darn if
they had only listened. FWIW: if you try to mess with Intel OPA2 fabric
these days there is a back story. A few years ago I had a quite a battle
with the HW folks, but I won that one. The SW cannot tell the difference
between on-die or off-die, so the OS does not have the manage it. Huge
difference in OS performance and space efficiency. But I suspect that
there are some HW folks that spit on the floor when I come in the room.
We'll see if I am proven to be right in the long run; but at 1M cores I
don't want to think of the OS mess to manage two different types of memory
for the message system.
So, is poor vision common? All too common.
Indeed. But to be fair, you can also end up with being like DEC and often
late to the market.
M
y example is of Alpha (and 64-bits vs 32-bit). No attempt to support
32-bit was really done because 64-bit was the future. Sr folks considered
32-bit mode was wasteful. The argument was that adding it was not only
technically not a good idea, but it would suck up engineering resources to
implement it in both HW and SW. Plus, we coming from the VAX so folks had
to recompile all the code anyway (god forbid that the SW might not be
64-bit clean mind you). [VMS did a few hacks, but Tru64 stayed 'clean.']
Similarly (back to UNIX theme for this mailing list) Tru64 was a rewrite
of OSF/1 - but hardly any OSF code was left in the DEC kernel by the time
it shipped. Each subsystem was rewritten/replaced to 'make it perfect'
[always with a good argument mind you, but never looking at the long term
issues]. Those two choices cost 3 years in market acceptance. By the
time, Alphas hit the street it did not matter. I think in both cases would
have been allowed Alpha to be better accepted if DEC had shipped earlier
with a few hacks, but them improved Tru64 as a better version was developed
(*i.e.* replace the memory system, the I/O system, the TTY handler, the FS
just to name a few that got rewritten from OSF/1 because folks thought they
were 'weak').
The trick in my mind, is to identify the real technical features you can
not fix later and get those right at the beginning. Then place the bet on
those features, and develop as fast as you can and do the best with them as
you you are able given your constraints. Then slowly over time improve the
things that mattered less at the beginning as you have a review stream.
If you wait for perfection, you get something like Alpha which was a great
architecture (particularly compared to INTEL*64) - but in the end, did not
matter.
Clem
ᐧ