Perkin-Elmer Sort/Merge II vs Unix sort(1)

List overview All Threads
Download

newer

older

Re: Perkin-Elmer Sort/Merge II vs...

Diomidis Spinellis

17 Jan 2025 17 Jan '25

5:23 p.m.

I chanced upon a brochure describing the Perkin-Elmer Series 3200 / (previously Interdata, later Concurrent Computer Corporation) Sort/Merge II utility [1]. It is instructive to compare its design against that of the contemporary Unix sort(1) program [2]. - Sort/Merge II appears to be marketed as a separate product (P/N S90-408), whereas sort(1) was/is an integral part of the Unix used throughout the system. - Sort/Merge II provides interactive and batch command input modes; sort(1) relies on the shell to support both usages. - Sort/Merge II appears to be able to also sort binary files; sort(1) can only handle text. - Sort/Merge II can recover from run-time errors by interactively prompting for user corrections and additional files. In Unix this is delegated to shell scripts. - Sort/Merge II has built-in support for tape handling and blocking; sort(1) relies on pipes from/to dd(1) for this. - Sort/Merge II supports user-coded decision subroutines written in FORTRAN, COBOL, or CAL. Sort(1) doesn't have such support to this day. One could construct a synthetic key with awk(1) if needed. - Sort/Merge II can automatically "allocate" its temporary file. For sort(1) file allocation is handled by the Unix kernel. To me this list is a real-life demonstration of the differences between the, prevalent at the time, thoughtless agglomeration of features into a monolith approach against Unix's careful separation of concerns and modularization via small tools. The same contrast appears in a more contrived setting in J. Bentley's CACM Programming Pearl's column where Doug McIlroy critiques a unique word counting literate program written by Don Knuth [3]. (I slightly suspect that the initial program specification was a trap set up for Knuth.) I also think that the design of Perkin-Elmer's Sort/Merge II shows the influence of salespeople forcing developers to tack-on whatever features were required by important customers. Maybe the clean design of Unix owes a lot to AT&T's operation under the 1956 consent decree that prevented it from entering the computer market. This may have shielded the system's design from unhealthy market pressures during its critical gestation years. [1] https://bitsavers.computerhistory.org/pdf/interdata/32bit/brochures/Sort_Me… [2] https://s3.amazonaws.com/plan9-bell-labs/7thEdMan/v7vol1.pdf#page=166 [3] https://doi.org/10.1145/5948.315654 Diomidis - https://www.spinellis.gr

Show replies by date

Bakul Shah

17 Jan 17 Jan

7:10 p.m.

On Jan 17, 2025, at 9:23 AM, Diomidis Spinellis <dds(a)aueb.gr> wrote:

...

I also think that the design of Perkin-Elmer's Sort/Merge II shows the influence of salespeople forcing developers to tack-on whatever features were required by important customers. Maybe the clean design of Unix owes a lot to AT&T's operation under the 1956 consent decree that prevented it from entering the computer market. This may have shielded the system's design from unhealthy market pressures during its critical gestation years.

IIRC sort/merge was/is a pretty major thing on IBM mainframes, with products from multiple companies. May be Perkin-Elmer were trying to compete with mainframe sort/merge products? Also, I suspect that for sorting terabytes of data Unix sort likely won't work as fast as mainframe sorts....

Marc Rochkind

7:35 p.m.

Why did you say "thoughtless agglomeration of features?" Do you know anything about the design of the P-E S/M, or is just a biased guess? Have you ever tried a large external sort with UNIX commands? Marc On Fri, Jan 17, 2025, 12:10 PM Bakul Shah via TUHS <tuhs(a)tuhs.org> wrote:

...

On Jan 17, 2025, at 9:23 AM, Diomidis Spinellis <dds(a)aueb.gr> wrote:

I also think that the design of Perkin-Elmer's Sort/Merge II shows the

influence of salespeople forcing developers to tack-on whatever features were required by important customers. Maybe the clean design of Unix owes a lot to AT&T's operation under the 1956 consent decree that prevented it from entering the computer market. This may have shielded the system's design from unhealthy market pressures during its critical gestation years. IIRC sort/merge was/is a pretty major thing on IBM mainframes, with products from multiple companies. May be Perkin-Elmer were trying to compete with mainframe sort/merge products? Also, I suspect that for sorting terabytes of data Unix sort likely won't work as fast as mainframe sorts....

Diomidis Spinellis

18 Jan 18 Jan

2:51 p.m.

I gave specific examples of facilities offered by the Perkin-Elmer Sort/Merge, (file allocation, blocking, interactive and batch modes) that on Unix systems are handled in a way that allows all programs to benefit from them. The Unix way reduces duplication and makes the system more versatile by offering the facilities to all programs. I based my comparison on the documented facilities of the two programs. I also have some first hand experience with Perkin-Elmer's OS/32. In the 1990s I was involved in servicing and transferring some record-keeping applications from a Perkin-Elmer running OS/32 and RELIANCE to a Unix system running Ingres. I found I was a lot more productive in Unix's shell than in Perkin-Elmer's MTM. (Admittedly, this could also be a matter of experience.) In 2018 I used the Unix sort and join commands to speed up a MariaDB relational join of a five billion row table with a 847 million row table (108 GB in total) from 380 hours to 12 hours [1], so I'm very happy with how Unix sort can handle moderately large data sets. The GNU version will even recursively merge intermediate files when it runs out of file descriptors. Even the Seventh Edition sort would overflow to temporary files and merge them [2]. I'm sure the mainframe sort programs did some pretty amazing things and could run circles around the puny 830 line Unix Seventh Edition sort program. The 215 page IBM DOS VS sort documentation that John Levine posted here is particularly impressive. But I can't stop thinking that, in common with the mainframes these programs were running on, they represent a mindset that has been surpassed by superior ideas. [1] https://www.spinellis.gr/blog/20180805/ [2] https://github.com/dspinellis/unix-history-repo/blob/Research-V7/usr/src/cm… Diomidis On 17-Jan-25 21:35, Marc Rochkind wrote:

...

Larry McVoy

3:16 p.m.

On Sat, Jan 18, 2025 at 04:51:15PM +0200, Diomidis Spinellis wrote:

...

I'm sure the mainframe sort programs did some pretty amazing things and could run circles around the puny 830 line Unix Seventh Edition sort program. The 215 page IBM DOS VS sort documentation that John Levine posted here is particularly impressive. But I can't stop thinking that, in common with the mainframes these programs were running on, they represent a mindset that has been surpassed by superior ideas.

I disagree. Go back and read the reply where someone was talking about sorting datasets that spanned multiple tapes, each of which was much larger than local disk. sort(1) can't begin to think about handling something like that. I have a lot of respect for how Unix does things, if the problem fits then the Unix answer is more simple, more flexible, it's better. If the problem doesn't fit, the Unix answer is awful. cmd < data | cmd2 | cmd3 is a LOT of data copying. A custom answer that did all of that in one address space is a lot more efficient but also a lot more special purpose. Unix wins on flexibility and simplicity, special purpose wins on performance.

Paul Winalski

3:40 p.m.

On Sat, Jan 18, 2025 at 10:17 AM Larry McVoy <lm(a)mcvoy.com> wrote:

...

On Sat, Jan 18, 2025 at 04:51:15PM +0200, Diomidis Spinellis wrote:

But I can't stop thinking that, in common with the mainframes these programs were running on, they represent a

mindset

that has been surpassed by superior ideas.

Another consideration: the smaller System/360 mainframes ran DOS (Disk Operating System) or TOS (Tape Operating System, for shops that didn't have disks). These were both single-process operating systems. There is no way that the Unix method of chaining programs together could have been done. OS MFT (Multiprogramming with a Fixed number of Tasks) and MVT (Multiprogramming with a Variable number of Tasks) were multiprocess systems, but they lacked any interprocess communication system (such as Unix pipes). True databases in those days were rare, expensive, slow, and of limited capacity. The usual way to, say, produce a list of customers who owed money, sorted by how much they owed would be: [1] scan the data set for customers who owed money and write that out to tape(s) [2] use sort/merge to sort the data on tape(s) in the desired order [3] run a program to print the sorted data in the desired format It is important in step [2] to keep the tapes moving. Start/stop operations waste a ton of time. Most of the complexity of the mainframe sort/merge programs was in I/O management to keep the devices busy to the maximum extent. The gold standard for sort/merge in the IBM world was a third-party program called SyncSort. It cost a fortune but was well worth it for the big shops. So the short, bottom line answer is that the Unix way wasn't even possible on the smaller mainframes and was too inefficient for the large ones. -Paul W.

Marc Rochkind

4:54 p.m.

Another problem with arrangements of small UNIX commands in pipelines is that the actual arrangement in use suffers from reliability and usability problems: 1. No way to test the whole, since in general each application has a unique structure with a potentially different choice of components, (A shell program executes whatever commands are on the system, not those it might have been tested with.) 2. No comprehensive error reporting (at best, reporting from individual commands), and 3. No way to provide support. On a much smaller scale, imagine a component stereo setup that is delivering bad sound. You have a turntable, an arm, a cartridge, a pre-amp, an amp, speakers, and cables and wires, typically from seven or more different manufacturers. Not one of them would be able to help you with support. The dealer would, if you bought the whole lot from them. Or you could pay a consultant. This is one reason why in the 1960s so-called console stereos were popular. Generally, console stereos delivered inferior sound. This isn't a criticism of sorting with UNIX commands, it's a broader criticism of the UNIX software tools approach for serious application development. Of course, one could build a single system out of components, and package it all together as a tested and supported product. That's exactly what object-oriented programming does, and very successfully. Marc On Sat, Jan 18, 2025 at 8:50 AM Paul Winalski <paul.winalski(a)gmail.com> wrote:

...

On Sat, Jan 18, 2025 at 10:17 AM Larry McVoy <lm(a)mcvoy.com> wrote:

On Sat, Jan 18, 2025 at 04:51:15PM +0200, Diomidis Spinellis wrote:

But I can't stop thinking that, in common with the mainframes these programs were running on, they represent a

mindset

that has been surpassed by superior ideas.

-- Subscribe to my Photo-of-the-Week emails at my website mrochkind.com.

sjenkin＠canb.auug.org.au

19 Jan 19 Jan

3:45 a.m.

I’d like to challenge the "Big Iron” hypothesis having worked with IBM/370 systems early on, DOS-VS, VM/CMS and some OS/MVS. The system design and standard tools forced considerable complexity & waste in CPU time & storage compared to Unix I'd used at UNSW. Probably the harshest criticism is the lack of O/S & tool development forced by IBM’s “backwards compatibility” model - at least while I had to battle it. [ Ken Robinson at UNSW had used OS/360 since ~1965. in 1975 he warned me about a pernicious batch job error message, ] [ “No space” - except it didn’t say on _which_ ‘DD’ (data definition == file). The O/S _knew_ exactly what was wrong, but didn’t say.] [ I hit this problem at work ~1985, costing me a week or two of time, plus considerable ‘chargeback’ expenses for wasted CPU & disk usage ] [ the problem was a trivial one if I’d had Unix piplelines available] Just because mainframes are still used for the majority of business critical online “transaction” systems, doesn’t mean they are great, even good, solutions. It only means the "cost of exit” is more than the owners wish to pay, it’s cheaper to keep old designs running than to change. To achieve the perceived ‘high performance’ of mainframes required considerable SysProg, programmer/analyst & Operations work/ time. Simple things such as the optimum ‘block size’ for a particular disk drive caused months of work for our operations team when we changed drives. (2314 removable to 3350 sealed HDA’s) Andrew Hume’s “Project Gecko” is worth reading for those who don’t know it. I’m sure if Andrew & team had been tried to build a similar system a decade before, they’d have figured a way to stream data between tape drives, the initial use-case for ’syncsort’ discussed. Andrew used the standard Unix tools, a small amount of C, flat files and intelligent ’streaming processing’ from one disk to another, then back, to push a SUN system to its limits, and handsomely beat Oracle. We’ve already had the Knuth / McIlroy ‘literate programming’ vs ’shell one-liner’ example in this thread. It comes down to the same thing: Unix’s philosophy is good design and “Tools to Build Tools”, allowing everyone to Stand On the Shoulders of Giants, not _have_ to endlessly reinvent the wheel for themselves, which the mainframe world forces on everyone. ============ Gecko: tracking a very large billing system Andrew Hume, Scott Daniels, Angus MacLellan 2000 <https://www.usenix.org/legacy/event/usenix2000/general/full_papers/hume/hume.pdf> ============

...

On 19 Jan 2025, at 02:40, Paul Winalski <paul.winalski(a)gmail.com> wrote: Another consideration: the smaller System/360 mainframes ran DOS (Disk Operating System) or TOS (Tape Operating System, for shops that didn't have disks). These were both single-process operating systems. There is no way that the Unix method of chaining programs together could have been done. OS MFT (Multiprogramming with a Fixed number of Tasks) and MVT (Multiprogramming with a Variable number of Tasks) were multiprocess systems, but they lacked any interprocess communication system (such as Unix pipes). True databases in those days were rare, expensive, slow, and of limited capacity. The usual way to, say, produce a list of customers who owed money, sorted by how much they owed would be: [1] scan the data set for customers who owed money and write that out to tape(s) [2] use sort/merge to sort the data on tape(s) in the desired order [3] run a program to print the sorted data in the desired format It is important in step [2] to keep the tapes moving. Start/stop operations waste a ton of time. Most of the complexity of the mainframe sort/merge programs was in I/O management to keep the devices busy to the maximum extent. The gold standard for sort/merge in the IBM world was a third-party program called SyncSort. It cost a fortune but was well worth it for the big shops. So the short, bottom line answer is that the Unix way wasn't even possible on the smaller mainframes and was too inefficient for the large ones. -Paul W.

============ Gecko: tracking a very large billing system Andrew Hume, Scott Daniels, Angus MacLellan 1999/2000 <https://www.usenix.org/legacy/event/usenix2000/general/full_papers/hume/hume.pdf> This paper describes Gecko, a system for tracking the state of every call in a very large billing system, which uses sorted flat files to implement a database of about 60G records occupying 2.6TB. After a team at Research, including two interns from Consumer Billing, built a successful prototype in 1996, the decision was made to build a production version. A team of six people (within Consumer Billing) started in March 1997 and the system went live in December 1997. The design we implemented to solve the database problem does not use conventional database technology; as described in [Hum99], we experimented with an Oracle-based implementation, but it was unsatisfactory. Instead, we used sorted flat files and relied on the speed and I/O capacity of modern high-end Unix systems, such as large SGI and Sun systems. The system supporting the datastore is a Sun E10000, with 32 processors and 6GB of memory, running Solaris 2.6. The datastore disk storage is provided by 16 A3000 (formerly RSM2000) RAID cabinets, which provides about 3.6TB of RAID-5 disk storage. For backup purposes, we have a StorageTek 9310 Powderhorn tape silo with 8 Redwood tape drives. The datastore is organised as 93 filesystems, each with 52 directories; each directory contains a partition of the datastore… We can characterise Gecko’s performance by two measures. The first is how long it takes to achieve the report and cycle end gates. The second is how fast we can scan the datastore performing an ad hoc search/extract. Over the last 12 cycles, the report gate ranged between 6.1 and 9.9 wall clock hours, with an average time of 7.6 hours. The cycle end gate is reached after the updated datastore has been backed up and any other housekeeping chores have been completed. Over the last 12 cycles, the cycle end gate ranged between 11.1 and 15.1 wall clock hours, with an average time of 11.5 hours. Both these averages comfortably beat the original requirements. The implementation of Gecko relies heavily on a modest number of tools in the implementation of its processing and the management of that processing. Nearly all of these have application beyond Gecko and so we describe them here. Most of the code is written in C and ksh; the remainder is in awk. The Gecko scripts make extensive use of grep, and in particular, fgrep for searching for many fixed strings in a file. Solaris’s fgrep has an unacceptably low limit on the number of strings (we routinely search for 5-6000 strings, and sometimes 20000 or so). The XPG4 version has much higher limits, but runs unacceptably slowly with large lists. We finally switched to gre, developed by Andrew Hume in 1986. For our larger lists, it runs about 200 times faster, cutting run times from 45 minutes down to 15 seconds or so. ============ -- Steve Jenkin, IT Systems and Design 0412 786 915 (+61 412 786 915) PO Box 38, Kippax ACT 2615, AUSTRALIA mailto:sjenkin@canb.auug.org.au http://members.tip.net.au/~sjenkin

Bakul Shah

18 Jan 18 Jan

4 p.m.

On Jan 18, 2025, at 7:16 AM, Larry McVoy <lm(a)mcvoy.com> wrote:

...

On Sat, Jan 18, 2025 at 04:51:15PM +0200, Diomidis Spinellis wrote:

Mainframes had usage based pricing, not unlike what you pay for renting resources in the cloud, so performance really mattered. Also note that users use whatever computing resources they have available to get their job done, ideally at the lowest cost. Elegance of any OS architecture is secondary, if that.

Tom Lyon

4:25 p.m.

...

On Jan 18, 2025, at 7:16 AM, Larry McVoy <lm(a)mcvoy.com> wrote:

On Sat, Jan 18, 2025 at 04:51:15PM +0200, Diomidis Spinellis wrote: > I'm sure the mainframe sort programs did some pretty amazing things and > could run circles around the puny 830 line Unix Seventh Edition sort > program. The 215 page IBM DOS VS sort documentation that John Levine

posted

> here is particularly impressive. But I can't stop thinking that, in

common

> with the mainframes these programs were running on, they represent a

mindset

that has been surpassed by superior ideas.

ron minnich

5:07 p.m.

I checked and syncsort is still out there, doing their thing. Fifty years of sorting! Sort of amazing. On Sat, Jan 18, 2025 at 8:40 AM Tom Lyon <pugs78(a)gmail.com> wrote:

...

Related to the sort discussion, there's an oral history of Duane Whitlow, founder of SyncSort, which was a big deal in IBM shops in the 70s. (and perhaps later; I lost track) https://archive.computerhistory.org/resources/access/text/2013/05/102702251… On Sat, Jan 18, 2025 at 8:00 AM Bakul Shah via TUHS <tuhs(a)tuhs.org> wrote: > On Jan 18, 2025, at 7:16 AM, Larry McVoy <lm(a)mcvoy.com> wrote: > > > > On Sat, Jan 18, 2025 at 04:51:15PM +0200, Diomidis Spinellis wrote: > >> I'm sure the mainframe sort programs did some pretty amazing things and > >> could run circles around the puny 830 line Unix Seventh Edition sort > >> program. The 215 page IBM DOS VS sort documentation that John Levine > posted > >> here is particularly impressive. But I can't stop thinking that, in > common > >> with the mainframes these programs were running on, they represent a > mindset > >> that has been surpassed by superior ideas. > > > > I disagree. Go back and read the reply where someone was talking about > > sorting datasets that spanned multiple tapes, each of which was much > > larger than local disk. sort(1) can't begin to think about handling > > something like that. > > > > I have a lot of respect for how Unix does things, if the problem fits > > then the Unix answer is more simple, more flexible, it's better. If > > the problem doesn't fit, the Unix answer is awful. > > > > cmd < data | cmd2 | cmd3 > > > > is a LOT of data copying. A custom answer that did all of that in > > one address space is a lot more efficient but also a lot more special > > purpose. Unix wins on flexibility and simplicity, special purpose > > wins on performance. > > Mainframes had usage based pricing, not unlike what you pay for renting > resources in the cloud, so performance really mattered. Also note that > users use whatever computing resources they have available to get their > job done, ideally at the lowest cost. Elegance of any OS architecture > is secondary, if that. > >

Marc Rochkind

7:39 p.m.

On Sat, Jan 18, 2025 at 12:30 PM ron minnich <rminnich(a)gmail.com> wrote:

...

I checked and syncsort is still out there, doing their thing. Fifty years of sorting! Sort of amazing.

You mean the product has been on the market that long, or that a sort is still running? ;-) Marc -- Subscribe to my Photo-of-the-Week emails at my website mrochkind.com.

John Levine

17 Jan 17 Jan

8:07 p.m.

It appears that Diomidis Spinellis <dds(a)aueb.gr> said:

...

That's not a resaonable comparison. In the 1960s and 1970s computers spent more time doing sort/merge than anything else, perhaps than everything else. Computer manufacturers tried really hard to make sorting fast, with clever hacks like compiling the comparison rules into machine code so they don't have to be reinterpreted for each record, scheduling their own I/O to keep devices busy, and reading intermediate tapes backward so they didn't have to rewind between passes. They also handle really big files, tape files that span more than one tape reel or sometimes disk files that span more than one removable pack.. In that era a tape held about 150MB and a 3330 disk pack was about 100MB. If you had big files, you had to keep them on tape and that meant a lot of sorting and merging to do updates. Even on disk, databases were nothing like they are now and what would now be in a SQL database was more likely in sorted files that were rewritten periodically with changes merged in. The P-E sort is a mainframe sort. Compare it to this IBM DOS VS sort and you'll see many of the same features, I am sure not by coincidence. https://bitsavers.computerhistory.org/pdf/ibm/370/DOS_VS/SC33-4044-2_DOS_VS… The unix sort program is fine for what it does which is sorting toy sized files on small disks. There's nothing wrong with that, I still use it all the time, but other than the name it doesn't have much in common with mainframe sort/merge.

Dave Horsfall

18 Jan 18 Jan

4:46 a.m.

On Sat, 17 Jan 2025, John Levine wrote:

...

The unix sort program is fine for what it does which is sorting toy sized files on small disks. There's nothing wrong with that, I still use it all the time, but other than the name it doesn't have much in common with mainframe sort/merge.

Hands up all those who remember SORMG... -- Dave

192

days inactive

194

days old

tuhs@tuhs.org

Manage subscription

13 comments

10 participants

tags (0)

participants (10)

Bakul Shah
Dave Horsfall
Diomidis Spinellis
John Levine
Larry McVoy
Marc Rochkind
Paul Winalski
ron minnich
sjenkin＠canb.auug.org.au
Tom Lyon