On Tue, Mar 10, 2020 at 2:43 PM Doug McIlroy <doug(a)cs.dartmouth.edu> wrote:
This begs
questions of stability
Astute question. I had that in my original draft, but eliminited
it for what I thought was clarity. Anyway, depending on implementation
of sort, you may need sort -s. Of course it doesn't matter which copy
among several equal lines uniq produces, nor does it matter in sort
when there are no comparison options--they're all the same.
Thanks. That's interesting.
Did `sort -s` come later? The idea that you preferred clarity over
stability for `sort -u` would indicate so, otherwise one might imagine that
`-u` would just imply `-s` and that would be that.
I don't know enough about the
internals of sed to know even what algorithm it
uses
(... a disk-based merge sort?)
sed is not a sorting program--basically it copies input to
output, making line-by-line editing changes. That's the
way I meant to use it in sed s/nonkeys//|sort -keys|uniq.
(I have added options to sort, hopefully for clarity).
The argument to sed here means substitute the empty
string for the nonkey fields (specified by a regular expression).
`sed` in my email was a typo, as you speculated below.
Interestingly, this `sed` construction prior to `sort` loses information,
which perhaps doesn't matter in any given specific case, but is
insufficient in general, which I gathered to be the entire reason you
implemented `sort -u`.
If "sed" was a typo for "sort",
It was.
all versions of sort that
I know of use an internal sorting algorithm for big
chunks
of the file, then combines the chunks by merge. But internal
sorting varies all over the map--variations on quicksort,
radix sort, merge sort, ...
It's the details of the internal sorts that are most interesting in some
sense, as the merges are probably fairly straight forward but the internal
sorts will affect stability and have other interesting characteristics.
As an aside, one must imagine that, in this day and age, a "big chunk" is
probably big enough to hold the vast majority of files entirely in RAM, and
only exceptionally large files actually require merging multiple blocks.
- Dan C.