Warren Toomey
wkt@cs.adfa.edu.au
Supercomputers are those types of computers whose hardware is designed to solve certain classes of computation problems. Historically, these were the most expensive machines available; the CPUs were massively superscalar and had the best floating-point performance at any point in time, the I/O bandwidth was enormous, and peripherals were over-engineered to use the I/O bandwidth. Due to their cost, not many of any supercomputer model were produced, but companies such as Cray Research, Convex and Fujitsu were able to carve out a business in this niche market.
These machines were not general-purpose computers. In most supercomputers, both the hardware and software were tailored to best solve the desired problem set. A good example is the provision of vector processing. Here, the machine provides a set of basic operations that work on vectors: arrays of integer or floating-point numbers. An example of a vector operation might be to add two 64-element vectors (arrays) to produce a new 64-element vector. In a traditional computer, this could only be achieved with a 64-iteration loop. In a supercomputer, this is done in a single instruction.
Supercomputer and vector processing can only achieve high speeds when the problem set can make use of the specialised functions in hardware and software []. If the solution to a problem cannot be written in a vectorised fashion, then a vectorised supercomputer will not be of any use. Fortunately, most large scientific and engineering problems have solutions which can make use of supercomputers' special features.
Despite their overall power, supercomputers do have several shortcomings. The first is their extreme cost, in the order of millions to hundreds of millions of dollars. Not all computer programs can be written to run effectively on supercomputers: simulations of large numbers of independent or interdependent objects cannot be vectorised, for example. And the limitations of a supercomputer's features may severely inhibit its use: you cannot multiply two 80-element vectors with a 64-element vector processor.
In the 1950s, researchers and engineers took a new tack, that of creating computers with multiple CPUs that could intercommunicate via shared memory and/or some form of bus architecture. These multiprocessor computers were designed to solve those types of problems for which traditional uniprocessor supercomputers were not suited. Until recently, this form of supercomputing had failed to live up to its promise due to design difficulties.
The biggest bottleneck for any multiprocessor computer is the communications infrastructure, be it a shared bus or shared main memory. All inter-CPU communication must traverse this communications infrastructure, and a single CPU's processing will be delayed while it contends for access to this resource. Increasing the number of CPUs will increase the contention for the shared bus or memory in a non-linear fashion, and hence restrict the scalability of any multiprocessor design. In the 1960s and 1970s, traditional supercomputer designers despaired of the multiprocessor promise:
``For over a decade prophets have voiced the contention that the organisation of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit co-operative solution... Demonstration is made of the continued validity of the single processor approach.''Gene Amdahl [].
``If you were plowing a field, what would you rather use: 2 strong oxen or 1,024 chickens?''Seymour Cray, 1970s.
Since this time, strange and radical approaches have been taken to try to overcome the communications bottleneck in multiprocessor design. For example, the Connection Machine built by Thinking Machines Corporation has up to 65,536 processors with no shared memory, but arranged in a communications hypercube with eight connections per CPU. Thus, a data exchange between any two CPUs must pass through no more than four intermediate CPUs. This sort of multiprocessor design forces programmers to use a completely new programming paradigm, and it also restricts the types of problem set, just like the uniprocessor supercomputers. And because of its unique design, such a machine is still very expensive.
Multiprocessor computers like the Connection Machine, like their supercomputer cousins, have found a niche market, but their price has kept them well out of reach of many organisations.
In the last 10 years, the consumer and business interest in the Internet and in desktop computers has seen an enormous increase in the available computing power to a single user. Current CPUs such as the Pentium III, the Sparc and the Alpha have the same processing performance (if not the overall I/O performance) of mainframes from only a decade ago. Similarly, short distance communications speeds between computers has gone from 1Mbps to 100Mbps, with switched (not shared) 1Gbps now becoming available.
The plummeting prices of commodity computing hardware and networking equipment made several people consider if a new form of multiprocessor design was possible. Rather than attempting to couple a large number of specialised CPUs together with a shared bus or memory (called close-coupling or tight-coupling), it might be cheaper to connect a number of off-the-shelf desktop computers together using standard switch LAN technology (called loose-coupling). Such loose-coupled systems are known as clusters.
In this design, CPUs have much more private memory than in normal multiprocessor designs: megabytes rather than kilobytes. This allows each CPU to host a fully-fledged operating system, rather than a simple monitor. This operating system can provide services such as virtual memory, file systems, protected threads and processes, interprocess communication and synchronisation.
The other difference between close-coupled and loose-coupled multiprocessor systems is the communications channels. There is a much larger latency with a loose-coupled system, as the use of networking peripherals, a LAN infrastructure and possibly a full networking stack in software adds delays in the inter-CPU communication.
Loose-coupled multiprocessor systems are more suited to running programs which can be divided up into separate tasks that can run in parallel on many CPUs, and where inter-task communication does not occur too frequently.
The term `Beowulf' does not define a single type of clustering technology. It is now used as a label for a group of clustering technologies that use off-the-shelf commodity hardware (PCs for example), LAN technologies, freely available operating systems (Linux, FreeBSD etc.) and programming libraries to build cheap loose-coupled multiprocessor systems. In fact, `Beowulf' was actually the name of the first cheap multiprocessor system:
``In the summer of 1994 Thomas Sterling and Don Becker, working at the Center of Excellence in Space Data and Information Sciences (CESDIS) under the sponsorship of the ESS project, built a cluster computer consisting of 16 Intel 486 processors connected by 10Mbps Ethernet. They called their machine Beowulf. The machine was an instant success and their idea of providing COTS (commodity off the shelf) base systems to satisfy specific computational requirements quickly spread through NASA and into the academic and research communities. The development effort for this first machine quickly grew into a what we now call the Beowulf Project.At Supercomputing '96 both NASA and DOE demonstrated clusters costing less than US$50,000 that achieved greater than a gigaflop/s sustained performance. A year later,... on the floor of Supercomputing '97, Caltech's 140 node cluster ran an N-body problem at a rate of 10.9 Gflop/s.''
Phil Merkey [].
Beowulf clusters are similar to a `network of workstations' (NoW), but have a different purpose. A NoW such as www.distributed.net uses the spare CPU cycles on thousands of users' computers around the world to solve problems. The computers in a Beowulf cluster are dedicated to the cluster's work. The cluster nodes are also only connected to each other, and not to a larger network. The operating system and software libraries on a Beowulf cluster are tuned to provide maximum CPU usage and inter-node communication. The cluster thus provides a well-defined, isolated, tuned multiprocessor environment which can be managed and used to its best potential.
The original Beowulf cluster used a shared half-duplex 10Mbps Ethernet link between its 16 CPUs. At present, it is common to find switched full-duplex 100Mbps Fast Ethernet links between nodes in a cluster. This allows 100 or more CPUs to have a full-duplex 100Mbps link to any other CPU in the cluster. And as Gigabit Ethernet at 1Gbps becomes available, many clusters will change over to this technology.
Ethernet as a cluster networking technology does have some shortcomings. Data transmissions from a CPU are not guaranteed to reach the destination, and may be lost. Each transmission frame is a minimum 64 bytes in length: transmission of a 20-byte message actually takes 84 bytes, so short messages have high overheads. This is exacerbated if Internet protocols such as IP or UDP are used, adding a further 20 bytes each to a transmission.
Despite these limitations, 100Mbps Fast Ethernet technology is widely available and extremely cheap, and is the preferred cluster networking technology at the moment.
The feasibility of Beowulf and off-the-shelf clusters has really only been possible through the availability of free Open Source operating systems such as Linux and FreeBSD, and the free development tools that come along with them. Even adding a price of $100 per CPU to the total cost of a cluster would make a 50-CPU cluster that much more expensive.
Both Linux and FreeBSD are clones of the Unix operating system, a very popular software development platform for researchers. Unix provides a programmer with a pre-emptive protected multitasking environment, virtual memory, interprocess communication and synchronisation, compilers, debuggers and other useful tools.
More importantly, as Open Source systems both Linux and FreeBSD come with full source code. This allows the developers of Beowulf clusters to examine and modify the code, to remove unneeded functions and to tune the performance of the underlying operating system. This can provide a dramatic increase in the speed of the cluster. One of the main Beowulf developers, Scyld Computing Corporation, has written replacement network device drivers to reduce each node's network latency. These drivers are, of course, Open Source and freely available.
One thing that neither Linux nor FreeBSD does is to hide the fact that there are multiple, independent, systems in a Beowulf cluster. Any running process on one node cannot directly see the other processes on the other nodes; in other words, there is no global, transparent process environment. There is also no ability for processes to migrate between nodes to help balance out the load on individual CPUs. Some of these problems are overcome through the use of appropriate cluster programming libraries.
With only Linux or FreeBSD machines installed on a cluster, the cluster is just a collection of Unix machines on a LAN. A set of programming libraries is required to add extra functionality: to allow N instances of a task to be spawned, one per CPU; to provide for transparent process synchronisation across disparate machines; and to provide process communication across disparate machines.
The two common Beowulf clustering libraries in use are the Parallel Virtual Machine (PVM) library [] and the Message Passing Interface (MPI) library. Both are freely available in Open Source format. PVM is the oldest and more popular, but MPI is now a standard for portable message-passing between parallel programs. MPI is available on all massively-parallel supercomputers (Beowulf or non-Beowulf), and is set to eventually replace PVM.
As an example, let us look at the features and concepts of PVM:
PVM tasks can call functions to pass messages, wait for other tasks, spawn new tasks and to modify the virtual machine (i.e the set of CPUs and resources). PVM currently supports several programming languages: C, C++, Java and FORTRAN.
Just a few months ago, the School of Computer Science at ADFA deployed its own Beowulf cluster. This cluster is composed of 32 dual-processor computers, each with 256 Megabytes of main memory, two 650MHz Pentium III processors and some local disk space. All machines are set up to run either standard RedHat Linux 6.2 or FreeBSD 4.0 as their operating system; the default is RedHat Linux. These machines are connected together over a dedicated 100Mbps LAN, using a Cisco Catalyst 3548 48-port Ethernet switch.
The school conducts research in a number of areas which are highly computation intensive: agent systems, data mining, evolutionary computation, machine learning, optimisation, speech processing and visualisation. We thus have a need for large amounts of computation power. In recent times, our existing research computers have been virtually 100% utilised. Depending on the application, the Beowulf cluster should have a raw computation power of 5 to 10 times the existing systems in the school. So our cluster will provide a welcome, and highly cost-effective, increment to the available resources. One of its prime advantages is extensibility: the cluster can readily be expanded in one- or two-CPU increments as required.
The cluster will also be used to conduct research on clustering algorithms. Much existing research in this area focuses on the fine granularity, low latency parallelism typical of scientific computing. The school's computing requirements are highly heterogeneous in granularity, so the research will focus on scheduling for such heterogeneous environments.
Single processor supercomputers are specially designed to achieve good performance at solving certain classes of problems, but they are extremely expensive. The low cost of commodity hardware, and the rise of Open Source software, has allowed an alternative to the supecomputer to be created: the Beowulf cluster. Like the supercomputer, this technology provides great computing performance for a certain class of problems, but at a fraction of the cost of a supercomputer.
Computer Science has recently built its own Beowulf cluster in a response at the lack of suitable uniprocessor resources to meet our current research computing demands. Fortunately, the problems we are trying to solve are suitable for such a clustering system.
As we approach the end of 2000, the cluster is deployed and working, and provides 58 out of 64 CPUs, and the PVM programming environment on top of RedHat Linux 6.2. The school now needs to do several things. We must begin to write research applications, or modify existing uniprocessor ones, to run on the cluster and use its raw power. We also need to begin the task of tuning the cluster, to find the bottlenecks in communication, process scheduling and process synchronisation, so that its overall performance is near optimal.