Re: [TUHS] The most surprising Unix programs

19 Mar 2020

Tomasz Rola writes on Thu, 19 Mar 2020 21:01:20 +0100 about awk:
...
 > One task I would be afraid to use awk for, is html
processing. Most of
> html sources I look at nowadays seems discouraging. Extracting
> anything of value from the mess requires something more potent, I
> think. 
If you want to tackle raw HTML from abitrary source, then I agree with
you: most HTML on the Web is not grammar conformant, there are
numerous vendor extensions, and the HTML is hideously idiosynchratic
and irregularly formatted.
The solution that I adopted 25 years ago was to write a grammar
recognizing, but violation lenient, prettyprinter for HTML.  It has
served well and I use it many times daily for my work in the BibNet
Project and TeX User Group bibliography archives, now approaching 1.55
million entries.  The latest public release is available here:
        http://www.math.utah.edu/pub/sgml/
I notice that the last version there is 1.01; I'll get that updated in
a couple of days to the latest 1.03 [subject to delays due to major
work dislocations due to the virus].  The code should install anywhere
in the Unix family without problems: I build and validate it on more
than 300 O/Ses in our test farm.
With standardized HTML, applying awk is easy, and I have more than 450
awk programs, and 380,000 lines of code, that process publisher
metadata to produce rough BibTeX entries that numerous other tools,
and some manual editing, turn into clean data for free access on the
Web.
For some journals, I run a single command of fewer than 15 characters
to download Web pages for journal issues for which I do not yet have
data, and then a single journal-specific command with no arguments
that runs a large shell script with a long pipeline that outputs
relatively clean BibTeX that then normally takes me only a couple of
minutes to visually validate in an editor session.  The major work
there is bracing of proper nouns in titles that my software did not
already handle, thereby preventing downcasing of those words in the
many bibliography styles that do so.
I'm on journal announcement lists for many publishers, so I often have
new data released to the Web just 5 to 10 minutes after receiving
e-mail about new issues.
The above-mentioned archives are at
        http://www.math.utah.edu/pub/bibnet
        http://www.math.utah.edu/pub/tex/bib
        http://www.math.utah.edu/pub/tex/bib/index-table.html
        http://www.math.utah.edu/pub/tex/bib/idx
        http://www.math.utah.edu/pub/tex/bib/toc
They are mirrored at Universität Karlsruhe, Oak Ridge National
Laboratory, Sandia National Laboratory, and elsewhere.
Like Al Aho, Doug McIlroy, and Arnold Robbins, I'm a huge fan of awk;
I believe that I was the first to port it to PDP-10 TOPS-20 and VAX
VMS in the mid-1980s, and it is one of the first mandatory tools that
I install on any new computer.
-------------------------------------------------------------------------------
- Nelson H. F. Beebe                    Tel: +1 801 581 5254                  -
- University of Utah                    FAX: +1 801 581 4148                  -
- Department of Mathematics, 110 LCB    Internet e-mail: beebe(a)math.utah.edu  -
- 155 S 1400 E RM 233                       beebe(a)acm.org  beebe(a)computer.org -
- Salt Lake City, UT 84112-0090, USA    URL: http://www.math.utah.edu/~beebe/ -
-------------------------------------------------------------------------------

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

1994

1993

1992

1991

1990

Re: [TUHS] The most surprising Unix programs