Re: [TUHS] Discuss of style and design of computer programs from a user stand point

7 May 2017

On Sat, May 6, 2017 at 7:42 PM, Noel Hunt &lt;noel.hunt(a)gmail.com&gt; wrote:
...
  I was about to suggest using the Plan9 port utilities
of the
 same name but it seems 'uniq' is not coded to handle Runes
 (aka utf-8). I don't imagine it would be hard to re-write it to
 handle utf-8. 
I guess I should have been clearer on what wouldn't work. It can't
possibly work for Japanese and Chinese where words aren't separated by
whitespace. Would cause problems in hybrid languages where words can
be composed of logograms and sonograms (say Japanese which often use a
few Kanji with hiragana endings that then run into hiragana particles
or other grammar elements). Can't work without modification (using
class names) for Cyrillic because there's no A or Z in words there.
Won't work in any language that has a discontiguous set of letters,
which includes many western european languages since all the accented
or otherwise decorated letters aren't in the range A-Z.
So whether or not the underlying tools can handle UTF-8 encoding,
there are problems with the original.
If you used:
 tr -cs "[:alpha:]" '\n' | tr "[:upper:]"
"[:lower:]" | sort | uniq -c
| sort -rn | sed ${1}q
you'd still have issues with languages that don't use word separators,
or write non-alphabetically.
Warner
...
  On Sun, May 7, 2017 at 11:15 AM, Warner Losh
&lt;imp(a)bsdimp.com&gt; wrote:

 On Sat, May 6, 2017 at 1:50 PM, Bakul Shah &lt;bakul(a)bitblocks.com&gt; wrote:
  tr -cs A-Za-z '\n' | tr A-Z a-z | sort
| uniq -c | sort -rn | sed ${1}q 
 The cool thing about this thread is that I learned two things: what tr
 -s does, and the Nq does for sed...
 Sadly, this doesn't work so well for text that isn't ASCII-7 english...
 Warner 

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

1994

1993

1992

1991

1990

Re: [TUHS] Discuss of style and design of computer programs from a user stand point