[TUHS] Re: Maximum Array Sizes in 16 bit C

20 Sep 2024

At 2024-09-21T01:07:11+1000, Dave Horsfall wrote:
...
  Unless I'm mistaken (quite possible at my age),
the OP was referring
 to that in C, pointers and arrays are pretty much the same thing i.e.
 "foo[-2]" means "take the pointer 'foo' and go back two
things"
 (whatever a "thing" is). 
"in C, pointers and arrays are pretty much the same thing" is a common
utterance but misleading, and in my opinion, better replaced with a
different one.
We should instead say something more like:
In C, pointers and arrays have compatible dereference syntaxes.
They do _not_ have compatible _declaration_ syntaxes.
Chapter 4 of van der Linden's _Expert C Programming_: Deep C Secrets_
(1994) tackles this issue head-on and at length.
Here's the salient point.
"Consider the case of an external declaration `extern char *p;` but a
definition of `char p[10];`.  When we retrieve the contents of `p[i]`
using the extern, we get characters, but we treat it as a pointer.
Interpreting ASCII characters as an address is garbage, and if you're
lucky the program will coredump at that point.  If you're not lucky it
will corrupt something in your address space, causing a mysterious
failure at some point later in the program."
...
  C is just a high level assembly language; 
I disagree with this common claim too.  Assembly languages correspond to
well-defined machine models.[1]  Those machine models have memory
models.  C has no memory model--deliberately, because that would have
gotten in the way of performance.  (In practice, C's machine model was
and remains the PDP-11,[2] with aspects thereof progressively sanded off
over the years in repeated efforts to salvage the language's reputation
for portability.)
...
  there is no such object as a "string" for
example: it's just an "array
 of char" with the last element being "\0" (viz: "strlen" vs.
"sizeof". 
Yeah, it turns out we need a well-defined string type much more
powerfully than, it seems, anyone at the Bell Labs CSRC appreciated.
string.h was tacked on (by Nils-Peter Nelson, as I understand it) at the
end of the 1970s and C aficionados have defended the language's
purported perfection with such vigor that they annexed the haphazardly
assembled standard library into the territory that they defend with much
rhetorical violence and overstatement.  From useless or redundant return
values to const-carelessness to Schlemiel the Painter algorithms in
implementations, it seems we've collectively made every mistake that
could be made with Nelson's original, minimal API, and taught those
mistakes as best practices in tutorials and classrooms.  A sorry affair.
So deep was this disdain for the string as a well-defined data type, and
moreover one conceptually distinct from an array (or vector) of integral
types that Stroustrup initially repeated the mistake in C++.  People can
easily roll their own, he seemed to have thought.  Eventually he thought
again, but C++ took so long to get standardized that by then, damage was
done.
"A string is just an array of `char`s, and a `char` is just a
byte"--another hasty equivalence that surrendered a priceless hostage to
fortune.  This is the sort of fallacy indulged by people excessively
wedded to machine language programming and who apply its perspective to
every problem statement uncritically.
Again and again, with signed vs. unsigned bytes, "wide" vs. "narrow"
characters, and "base" vs. "combining" characters, the champions of
the
"portable assembly" paradigm charged like Lord Cardigan into the pike
and musket lines of the character type as one might envision it in a
machine register.  (This insistence on visualizing register-level
representations has prompted numerous other stupidities, like the use of
an integral zero at the _language level_ to represent empty, null, or
false literals for as many different data types as possible.  "If it
ends up as a zero in a register," the thinking appears to have gone, "it
should look like a zero in the source code."  Generations of code--and
language--cowboys have screwed us all over repeatedly with this hasty
equivalence.
Type theorists have known better for decades.  But type theory is (1)
hard (it certainly is, to cowboys) and (2) has never enjoyed a trendy
day in the sun (for which we may be grateful), which means that is
seldom on the path one anticipates to a comfortable retirement from a
Silicon Valley tech company (or several) on a private yacht.
Why do I rant so splenetically about these issues?  Because the result
of such confusion is _bugs in programs_.  You want something concrete?
There it is.  Data types protect you from screwing up.  And the better
your data types are, the more care you give to specifying what sorts of
objects your program manipulates, the more thought you give to the
invariants that must be maintained for your program to remain in a
well-defined state, the fewer bugs you will have.
But, nah, better to slap together a prototype, ship it, talk it up to
the moon as your latest triumph while interviewing with a rival of the
company you just delivered that prototype to, and look on in amusement
when your brilliant achievement either proves disastrous in deployment
or soaks up the waking hours of an entire team of your former colleagues
cleaning up the steaming pile you voided from your rock star bowels.
We've paid a heavy price for C's slow and seemingly deeply grudging
embrace of the type concept.  (The lack of controlled scope for
enumeration constants is one example; the horrifyingly ill-conceived
choice of "typedef" as a keyword indicating _type aliasing_ is another.)
Kernighan did not help by trashing Pascal so hard in about 1980.  He was
dead right that Pascal needed, essentially, polymorphic subprograms in
array types.  Wirth not speccing the language to accommodate that back
in 1973 or so was a sad mistake.  But Pascal got a lot of other stuff
right--stuff that the partisanship of C advocates refused to countenance
such that they ended up celebrating C's flaws as features.  No amount of
Jonestown tea could quench their thirst.  I suspect the truth was more
that they didn't want to bother having to learn any other languages.
(Or if they did, not any language that anyone else on their team at work
had any facility with.)  A rock star plays only one instrument, no?
People didn't like it when Eddie Van Halen played keyboards instead of
guitar on stage, so he stopped doing that.  The less your coworkers
understand your work, the more of a genius you must be.
Now, where was I?
...
  What's the length of "abc" vs. how many
bytes are needed to store it? 
Even what is meant by "length" has several different correct answers!
Quantity of code points in the sequence?  Number of "grapheme clusters"
a.k.a. "user-perceived characters" as Unicode puts it?  Width as
represented on the output device?  On an ASCII device these usually had
the same answer (control characters excepted).  But even at the Bell
Labs CSRC in the 1970s, thanks to troff, the staff knew that they didn't
necessarily have to.  (How wide is an em dash?  How many bytes represent
it, in the formatting language and in the output language?)
...
  Giggle...  In a device driver I wrote for V6, I used
the expression
     "0123"[n]
 and the two programmers whom I thought were better than me had to ask
 me what it did...
 -- Dave, brought up on PDP-11 Unix[*] 
I enjoy this application of that technique, courtesy of Alan Cox.
  fsck-fuzix: blow 90 bytes on a progress indicator
  static void progress(void)
  {
      static uint8_t progct;
      progct++;
      progct&=3;
      printf("%c\010", "-\\|/"[progct]);
      fflush(stdout);
  }
...
  I still remember the days of BOS/PICK/etc, and I
staked my career on
 Unix. 
Not a bad choice.  Your exposure to and recollection of other ways of
doing things, I suspect, made you a more valuable contributor than those
who mazed themselves with thoughts of "the Unix way" to the point that
they never seriously considered any other.
It's fine to prefer "the C way" or "the Unix way", if you can
intelligibly define what that means as applied to the issue in dispute,
and coherently defend it.  Demonstrating an understanding of the
alternatives, and being able to credibly explain why they are inferior
approaches, is how to do advocacy correctly.
But it is not the cowboy way.  The rock star way.
Regards,
Branden
[1] Unfortunately I must concede that this claim is less true than it
    used to be thanks to the relentless pursuit of trade-secret means of
    optimizing hardware performance.  Assembly languages now correspond,
    particularly on x86, to a sort of macro language that imperfectly
    masks a massive amount of microarchitectural state that the
    implementors themselves don't completely understand, at least not in
    time to get the product to market.  Hence the field day of
    speculative execution attacks and similar.  It would not be fair to
    say that CPUs of old had _no_ microarchitectural state--the Z80, for
    example, had the not-completely-official `W` and `Z` registers--but
    they did have much less of it, and correspondingly less attack
    surface for screwing your programs.  I do miss the days of
    deterministic cycle counts for instruction execution.  But I know
    I'd be sad if all the caches on my workaday machine switched off.
[2] https://queue.acm.org/detail.cfm?id=3212479

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

1994

1993

1992

1991

1990

[TUHS] Re: Maximum Array Sizes in 16 bit C