On Tue, Mar 10, 2020 at 12:16 PM Doug McIlroy <doug(a)cs.dartmouth.edu> wrote:
The idea of a
simple rule is great, but the suggested rule fails on sort
-u
which afaik came after sort | uniq for
performance reasons.
As the guilty party for most of sort's comparison options, I can
attest that efficiency was not an objective of -u. It was invented
precisely because uniq had proved useful, but not when one was
interested in uniqueness only of some key aspect of the data.
-u differs from uniq in that -u selects samples based on
equality of keys, not equality of lines. In the default
case of whole-line keys, sort -u of course does exactly
what sort|uniq does.
For many applications of -u with keys, the non-key fields
are not of interest. Then sed s/nonkeys//|sort|uniq may
suffice. But sed did not exist when -u was invented.
And not all sort key specs are easily imitated in sed.
This begs questions of stability: in the event of non-unique keys and
non-key fields in the sortable data, which "records" (lines) are kept and
which are discarded? Surely the "first" is kept and subsequent entries with
the same key suppressed, but I confess I don't know enough about the
internals of sed to know even what algorithm it uses (I assume a disk-based
merge sort?), but I would imagine these details have changed over time.
- Dan C.