Re: [TUHS] Command line options and complexity

10 Mar 2020

On Tue, Mar 10, 2020 at 12:16 PM Doug McIlroy &lt;doug(a)cs.dartmouth.edu&gt; wrote:
...
   The idea of a
simple rule is great, but the suggested rule fails on sort  -u
  which afaik came after sort | uniq for
performance reasons. 
 As the guilty party for most of sort's comparison options, I can
 attest that efficiency was not an objective of -u. It was invented
 precisely because uniq had proved useful, but not when one was
 interested in uniqueness only of some key aspect of the data.
 -u differs from uniq in that -u selects samples based on
 equality of keys, not equality of lines. In the default
 case of whole-line keys, sort -u of course does exactly
 what sort|uniq does.
 For many applications of -u with keys, the non-key fields
 are not of interest. Then sed s/nonkeys//|sort|uniq may
 suffice. But sed did not exist when -u was invented.
 And not all sort key specs are easily imitated in sed.
 
This begs questions of stability: in the event of non-unique keys and
non-key fields in the sortable data, which "records" (lines) are kept and
which are discarded? Surely the "first" is kept and subsequent entries with
the same key suppressed, but I confess I don't know enough about the
internals of sed to know even what algorithm it uses (I assume a disk-based
merge sort?), but I would imagine these details have changed over time.
        - Dan C.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

1994

1993

1992

1991

1990

Re: [TUHS] Command line options and complexity