On Thu, Mar 19, 2020 at 02:57:59PM -0600, Nelson H. F. Beebe wrote:
[...]
If you want to tackle raw HTML from abitrary source, then I agree with
you: most HTML on the Web is not grammar conformant, there are
numerous vendor extensions, and the HTML is hideously idiosynchratic
and irregularly formatted.
The solution that I adopted 25 years ago was to write a grammar
recognizing, but violation lenient, prettyprinter for HTML. It has
served well and I use it many times daily for my work in the BibNet
Project and TeX User Group bibliography archives, now approaching 1.55
million entries. The latest public release is available here:
http://www.math.utah.edu/pub/sgml/
Thank you, I will have a longer look at those archives. My plan so far
was to explore html files with CL and Slime (interactive mode for CL
inside Emacs), which would allow me to actually find out what I want
to be looking for - well, hopefully :-).
--
Regards,
Tomasz Rola
--
** A C programmer asked whether computer had Buddha's nature. **
** As the answer, master did "rm -rif" on the programmer's home **
** directory. And then the C programmer became enlightened... **
** **
** Tomasz Rola mailto:tomasz_rola@bigfoot.com **