On Wed, Feb 3, 2021 at 8:34 PM Larry McVoy <lm@mcvoy.com> wrote:
I have to admit that I haven't looked at ARM assembler, the M1 is making
me rethink that.  Anyone have an opinion on where ARM lies in the pleasant
to unpleasant scale?

Redirecting to "COFF" as this is drifting away from Unix.

I have a soft spot for ARM, but I wonder if I should. At first blush, it's a pleasant RISC-ish design: loads and stores for dealing with memory, arithmetic and logic instructions work on registers and/or immediate operands, etc. As others have mentioned, there's an inline barrel shifter in the ALU that a lot of instructions can take advantage of in their second operand; you can rotate, shift, etc, an immediate or register operand while executing an instruction: here's code for setting up page table entries for an identity mapping for the low part of the physical address space (the root page table pointer is at phys 0x40000000):

        MOV     r1, #0x0000
        MOVT    r1, #0x4000
        MOV     r0, #0
.Lpti:  MOV     r2, r0, LSL #20
        ORR     r2, r2, r3
        STR     r2, [r1], #4
        ADD     r0, r0, #1
        CMP     r0, #2048
        BNE     .Lpti

(Note the `LSL #20` in the `MOV` instruction.)

32-bit ARM also has some niceness for conditionally executing instructions based on currently set condition codes in the PSW, so you might see something like:

1:      CMP     r0, #0
        ADDNE   r1, r1, #1
        SUBNE   r0, r0, #1
        BNE     1b

The architecture tends to map nicely to C and similar languages (e.g. Rust). There is a rich set of instructions for various kinds of arithmetic; for instance, they support saturating instructions for DSP-style code. You can push multiple registers onto the stack at once, which is a little odd for a RISC ISA, but works ok in practice.

The supervisor instruction set is pretty nice. IO is memory-mapped, etc. There's a co-processor interface for working with MMUs and things like it. Memory mapping is a little weird, in that the first-level page table isn't the same second-level tables: the first-level page table maps the 32-bit address space into 1MiB "sections", each of which is described by a 32-bit section descriptor; thus, to map the entire 4GiB space, you need 4096 of those in 16KiB of physically contiguous RAM. At the second-level, 4KiB page frames map page into the 1MiB section at different granularities; I think the smallest is 1KIB (thus, you need 1024 32-bit entries). To map a 4KiB virtual page to a 4KiB PFN, you repeat the relevant entry 4 times in the second-level page. It ends up being kind of annoying. I did a little toy kernel for ARM32 and ended up deciding to use 16KiB pages (basically, I map 4x4KiB contiguous pages) so I could allocate a single sized structure for the page tables themselves.

Starting with the ARMv8 architecture, it's been split into 32-bit aarch32 (basically the above) and 64-bit aarch64; the latter has expanded the number and width of general purpose registers, one is a zero register in some contexts (and I think a stack pointer in others? I forget the details). I haven't played around with it too much, but looked at it when it came out and thought "this is reasonable, with some concessions for backwards compatibility." They cleaned up the paging weirdness mentioned above. The multiple push instruction has been retired and replaced with a "push a pair of adjacent registers" instruction; I viewed that as a concession between code size and instruction set orthogonality.

So...Overall quite pleasant, and far better than x86_64, but with some oddities.

        - Dan C.