No Title

Warren's 2012 Summary of the Systems Architecture Subject

1 Basics of Computer Architecture

Different levels: user mode, kernel mode, hardware. What is at each level. What is the interface to each level.

Hardware: ISA is the interface Operating system: system call API is the interface.

von Neumann architecture, the instruction cycle Internals of CPU: ALU, registers, control logic.

Busses connect memory and CPU, and also device controllers.

Before von Neumann, computers were hardwired. von Neumann means instructions can be stored in memory.

Memory: seen as an array of bytes, each one with an address. However, bus size allow read/writes in multiples of bytes.

Three busses: address, data bus, control/status bus. What does each one do, what their sizes mean to the CPU design.

2 Instruction Set Design Design

Bus sizes.

Why # registers.

Why # operations and which ones.

# operands and literal values.

Size, how many instruction format(s).

Main differences between CISC and RISC: note that RISC != fewer instructions.

Basic MIPS ISA: # regs, instruction formats & the fields in each one, literals, basic set of instructions

3 Addressing Modes & Control Logic

RISC is generally a load/store architecture. CISC instructions can work directly on RAM.

Addressing modes make things easier/harder for the programmer/compiler writer. Each one helps to implement certain things: branches for loops/IFs, jumps for function calls, base/index/offset addressing for structure like arrays, objects, linked lists.

How the CPU works: some components are sequential (memory parts) , i.e. stateful, some components are combinatorial (ALU).

Instruction phases: fetch, decode, execute etc. PC controls where to get the next instruction. IR holds the fetched instruction which is decoded by the control logic.

Multiplexers take multiple inputs, only give 1 output, a control line chooses which input.

Basic difference between hardwired and microcoded control logic. RISC is generally hardwired, CISC generally microcoded.

4 Control/Flow of Execution,

Branches used to implement IF and all the loops. Know how to translate high level IF/FOR/WHILE/SWITCH into assembly code.

Function calls: know address of 1st instruction in the function. Need to store our regs somewhere. Need to store return address somewhere. Therefore, need a stack to hold this, especially for multiple function calls (think recursion).

The structure is known as a stack frame. Know the basic structure of a stack frame: return address, stored regs, local vars, args to next function.

Know basic memory layout of a process: code at bottom, global data above code (can grow), stack at top (can grow).

Different ISA have different instructions to do function calls/returns. Some (e.g. MIPS) jump you to the top of a function, you have to build frame yourself. Other CPUs (e.g. VAX) build the frame and do everything.

5 User and Kernel Mode, System Calls, I/O, Exceptions

User/Kernel Mode: why both, why do we need at least 2? On some arch. >2 modes (e.g. Intel has 4 rings).

What does each permit/deny. How to switch from each mode to the other: syscalls, exceptions, interrupts. Special instruction to get back to umode from kmode. Know how syscalls work.

I/O devices: mapped on the address bus, so appear as memory location. Each device has a controller with own logic and registers. CPU read/writes registers to get/send data, and also get/sent status/commands.

Know basic idea of how to send cmds to a device to get it to do something, how it will report status back in a status register.

Polling the status register: waste of CPU (busy waiting), so get device to send interrupt to the CPU.

Know what is interrupt, how they work, how CPU gets sent to interrupt handler, how it gets back to what it was doing, priority levels for the interrupts, and what devices get highest priority.

Know how interrupts work, what causes them and why. OS can field the exception and decide what to do about it. Attempt to do kmode stuff from umode, page fault.

How we can use syscalls/interrupts/exceptions to make a virtual computer.

6 CPU Memory Management, Context Switching

Memory management at the hardware level.

Idea of pages: fixed sized lumps which are mapped to fixed size frame of physical memory.

Idea of address space: each process sees a different reality of the layout of memory. Page mappings, offsets on each page that map to the same offset of the frame.

Permissions on each page: what are they, what are they used for.

Context switching: how is it achieved, we get regular interrupts in from some clock device (ticks), typically 100 or 1000 time a second. On each tick, choose new process, save old P's registers and memory map into the PCB, and reloads map & register for new one from its PCB. Bigger map for a process => slower context switches.

7 Introduction to Operating Systems

Main tasks for an OS: hide the hardware details from the user mode programs, provide a substitute in the way of abstract services, manage the hardware resources of the computer.

Uses kmode and umode to do all this: stop umode programs from getting access to all the resources.

The interface is the set of system calls the OS provides: its API.

Types of OS: mainframe, server, desktop, real-time, embedded. Know the basic characteristics of each.

Basic abstractions that OS provides: process, logical memory, files, windows, networking.

The syscall API is the interface: can build several OS that provide the same API.

Different internal architectures: monolithic, client-server, microkernels. Know the basic characteristics of each.

Understand the difference between the top-half and bottom-half of a monolithic kernel.

8 Processes

Process: a sequence of computer instructions executing within a restricted environment, i.e. logical address space with registers and access to syscalls.

We call the environment the context of the process.

Know the logical memory layout of a process.

Know the main state diagrams for the different process models: embedded, run to completion, run to block, run to block with pre-emption.

Know why we need pre-emption for interaction with the user (or clients on a server).

Timeslice: how long a CPU-bound process can run before it loses the CPU. Usually 1/1000 to 1/100 of a second.

Process Control Block (PCB): know what it stores.

Know what is a CPU-bound and an I/O-bound process.

Know the mechanics of a context switch.

Process scheduling: what are the main goals: fairness, efficiency, response time, throughput, policy.

Know how all of these work: FCFS, static priority scheduling, dynamic priority scheduling.

Know why starvation can occur, especially with static priority scheduling.

Know about the idle process.

9 Memory Management

Historical memory management schemes: bare machine, OS in ROM, partitions.

Partitions: base & limit registers, how they work to map a process' address space to a part of physical memory.

Warren's concepts of logical and physical memory, and how they are different.

Strategies to allocate partitions: first fit, best fit, worst fit, why best fit is bad and worst fit is better.

Partitions suffer from external fragmentation, must copy to remove the gaps.

Idea of pages: fixed sized lumps which are mapped to fixed size frame of physical memory.

Idea of address space: each process sees a different reality of the layout of memory. Page mappings, offsets on each page that map to the same offset of the frame.

Permissions on each page: what are they, what are they used for.

Page advantages: can share read-only pages, can share read-write pages copy-on-write, but it costs a page fault exception when one process touches the shared page, plus the copying.

Page disadvantages: internal fragmentation, can only reduce with smaller pages, but this costs in terms of context switching.

10 Virtual Memory, Disks

Paging: copying pages in/out of memory to disk. This is different to a paged architecture.

We page out pages when we are running short of available memory to allocate to process' memory requests.

Pages on the disk only have to be brought in when they are required, so we can do lazy paging.

Idea of the working set: those pages that a process needs right now to execute, this can be less than all the memory allocated to a process.

This is related to the locality of reference idea: over any short interval, a process only uses a small fraction of its entire memory.

We need to keep the working set of all processes in memory, we can choose to evict those pages not required, but we only do this when we run short of memory.

Page replacement strategies: FIFO, NRU, Second Chance, LRU.

FIFO: choose the page allocated for the longest time to evict. Not considered good, as this might be a heavily-used page.

NRU: classify all pages into 4 groups: A=0, M=0; A=0, M=1; A=1, M=0; A=1, M=1. Choose a page from the lowest non-empty group.

Second Chance, a variation on FIFO. When a page is needed, look at the oldest page. If it A bit is 0 (i.e it hasn't been accessed recently), use it. Otherwise, set A to 0 and put it on the tail (as if it had just been allocated). What this is doing is looking for a page that has not been accessed in a long time.

LRU: considered the best strategy, but can be expensive to implement. When a page is needed, pick the page unused for the longest time. Impossible to do accurately, so it is always approximated.

Paging can make a computer seem to have more physical memory than it actually has, as allocated memory = RAM + pages on the disk. This only works when we can fit the working set of all the processes in physical memory.

If working set bigger than physical memory, we end up thrashing: moving pages in and out to satisfy the current process memory accesses. Disks are very slow, so thrashing slows down a computer terribly. Better to buy more RAM to ensure enough RAM to hold all the working sets.

Disk: several platters with magnetic materials on each side. Several arms move in/out together to reach a specific track.

All tracks at the same position on the platters are known as a cylinder.

Disks are physical devices. Seek time much bigger than rotation time much bigger than transfer time.

We need to schedule the arm to avoid seeking as much as possible. Several strategies:

FCFS: service the disk requests in the other that they are made. Leads to lots of seeking so not good.

SSF: shortest seek first, after each seek, reorder the pending queue to make the next seek the shortest. Better but can lead to starvation of requests on the outermost/innermost tracks.

SCAN: also known as the elevator algorithm. When going from low to high tracks, reorder the pending queue from low to high and do the requests in order. When going from high to low, reorder the other way. Improves seek times, and good property is that any requests only has to wait 2 complete disk sweeps before it is serviced.

C-SCAN: SCAN but only service requests in ascending order. Makes the algorithm simpler, and is also good because long seeks allow the arm to accelerate, so the long sweep back from high to low also improves seek times.

DMA: nothing to do with little green men. Tell the device controller where to write data into memory, or where to read data from memory. The device accesses memory across the busses like the CPU, which means the CPU doesn't have to do any copying from the device's buffer for I/O. This saves CPU time which can be used to do more user-mode instructions. Device does the copy, and sends an interrupt when the I/O + copy is completed.

11 File Systems and their Performance

A file: abstract idea, persistent storage for some data.

Files have contents and attributes: name, type, owner, permissions, location in a directory, timestamps etc.

OS has to provide syscalls to access files: open, close, read, write, create, remove, change attributes etc.

Directories are used to help group files logically. They only exist to make life easier for humans.

A filesystem holds the data of the files, plus the metadata (attributes), plus the directory structure.

We need to know: list of free blocks, list of blocks per file, file metadata, directories. So a filesystem has an overhead, not all of it can be used just to store file contents.

How to join attributes and file blocks:

Linux: name points to file attributes in an i-node, i-node points to tree of disk blocks. We can have hard links: file with multiple names, no name has preference over the other.

Windows: name and attributes are together as a directory entry, this points to the first block in the FAT table.

Internal fragmentation: the last block of any file is not completely full.

We want all the blocks of a file near each other, track-wise. We can't use contiguous layout as this stops files from growing if they are adjacent to other files. But non-contiguous layout can cause blocks to be spread over the disk, requiring a defragmentation to bring them back together.

How to store the free list of blocks. Could use a list, which starts big but gets smaller as the disk is used up. List is big though. An alternative is to use a bitmap, where each block is represented by a bit, 0=free, 1=used. Bitmap is constant in size, usually much smaller than a list.

FAT filesystem: a FAT is a table (array) with as many entries as blocks on the disk. Each entry holds the number of the next block in a file, or a special value meaning "no more blocks in the file", or special values meaning free or bad.

A file is therefore represented as a singly-linked list of block numbers in the FAT table. Good for going forward in the file, not good for random access or going backward.

Historically, each entry was 12-bits long (max 1,024 blocks, good for floppies), or 16-bit (65,536 blocks, OK for 30M disk drives), now FAT-32 with 4 billion blocks, suits up to 2 terabyte filesystems.

System V Unix Filesystem: name separated from i-node which holds the attributes. i-node holds list of first 10 blocks (good for small files). After that, a block is allocated to hold the next 256 block numbers in an array: a single-indirect block. After that, a double-indirect block allocated to point to several single-indirect block. Ditto, triple blocks. So Unix has a tree-structure to quickly find any disk block used by the file, much better than a singly-linked list.

Link count of file names per i-node, so the file is deleted only when all the filenames go away.

Problem of early Unix: all i-nodes placed on low track numbers, so lots of seeking to get file attributes plus their contents.

Berkeley Fast Filesystem: place i-nodes across the disk, when starting a new file choose an i-node close to the first disk block. Reduces seek between them.

Lots of other advantages for the FFS: read the lecture notes!

We didn't spend time looking at file system reliability & performance.

Holey files: don't allocate blocks when their contents are all zeroes.

12 Synchronisation and Threads

Interprocess communication: we need this as processes often don't act independently, so they need to be able to communicate.

Files can be used for IPC, but generally only when communication can be time separated: write now, read later. Getting synchronised access to a file at the same time with multiple processes is problematic.

Pipes: a buffer in the OS where one process writes into the buffer, the other reads from it. Suitable for passing a stream of data between processes. The OS can block each one if the pipe is empty or completely full. Internet connections are effectively bi-directional pipes. Pipe issues: a stream of data has no structure, so you need to insert markers in the stream to mark separate sections.

Shared memory is the fastest IPC. Map pages shared read-write between 2 or more processes. A write by one process is seen immediately by another process. Now we have to synchronise access to the shared data.

Race condition: a flaw in a system whereby the output or result is unexpectedly and critically dependent on the sequence or timing of other events (Wikipedia).

Critical section: a piece of code that accesses a shared resource (data structure or device) that must not be concurrently accessed by more than one thread of execution, otherwise it will lead to a race condition (Wikipedia).

Make sure you understand why race conditions occur, and why a critical section has to atomic to prevent race conditions.

To avoid a critical section, we need the following conditions:

No two processes can be inside the critical section simultaneously.
No assumptions are made about the relative speeds of each process or the number of CPUs.
No process stopped outside its critical section should block another process.
No process should wait arbitrarily long to enter the critical section.

Infinite timeslices: violates conditions 1 and 4.

Strict alternation: violates conditions 2 and 3, and requires busy waiting.

Test and set lock instruction: good, but still requires busy waiting.

Semaphores: Dijkstra in 1965. Block a process trying to enter a critical section already locked by another process. When the process leaves the critical section, it unlocks it and the OS wakes up 1 blocked process and allows it into the critical section.

Two semaphore primitives: acquire(semaphore) and release(semaphore).

Semaphores require programmers to remember to use them. Languages can implicitly use them (monitors), so programmers don't have to remember.

The OS itself cannot use semaphores, as no section of the OS is a process. The OS disables interrupts to ensure the CPU doesn't go off elsewhere when it is inside a critical section. We have to trust the OS programmers to get it right.

Threads: multiple execution units living inside an address space. Each execution unit can be scheduled.

Understand why threads are useful: servers with multiple clients, break up execution over several CPUs. One I forgot to mention is a GUI process which has to keep the widgets on the window updated while it also does some CPU-intensive work.

Each thread needs its own stack for its own stack frames. One problem is how to keep the stack separate in the single address space. We have to be careful not to call too many functions in a single thread, or the thread's stack will collide with another thread's stack.

Also, now all the global data is shared, so we need semaphores etc. to synchronise access.

Different thread models: user threads, kernel threads, lightweight processes, mediumweight processes.

User threads: OS doesn't know about threads, just processes. A library does the thread scheduling at user level: each thread has to co-operatively give up the CPU so another thread in the process can run.

Kernel threads: the kernel does a lot of abstract stuff which isn't directly related to the hardware, e.g. the filesystem. Run this code as threads with kernel-mode permissions, means we can schedule it. I didn't spend much time on this.

Lightweight processes: here the OS is thread-aware, and all threads in a process are scheduled by the OS, and have full access to the shared address space. Makes context switching between them faster.

Mediumweight processes: OS is thread-aware, but only the code and global data pages are shared, each thread has its own stack page mappings. This removes the potential stack collision problem, but makes context switching a bit slower.

Heavyweight processes: these are normal processes, only the code is shared read-only, and no read-write memory is shared.

File translated from T_EX by T_TH, version 3.85.
On 31 Mar 2012, 12:36.