Hi Dan,
If you want to
understand:
- the maths of regular expressions,
- the syntax of regexps which these days expresses more than REs, and
- the regexp engines in programs, the differences in how they work
and what they match, and
- how to efficiently steer an engine's internals
then I recommend Jeffrey Friedl's Mastering Regular Expressions.
http://regex.info/book.html
I'm afraid I must sound a note of caution about Friedl's book. Russ
Cox alludes to some of the problems in the "History and References"
section of his page (
https://swtch.com/~rsc/regexp/regexp1.html) that
was linked earlier
Russ says:
1 ‘Finally, any discussion of regular expressions would be incomplete
without mentioning Jeffrey Friedl's book Mastering Regular Expressions,
perhaps the most popular reference among today's programmers.
2 Friedl's book teaches programmers how best to use today's regular
expression implementations, but not how best to implement them.
3 What little text it devotes to implementation issues perpetuates the
widespread belief that recursive backtracking is the only way to
simulate an NFA.
4 Friedl makes it clear that he [neither understands nor respects] the
underlying theory.’
http://regex.info/blog/2006-09-15/248
I think Grant is after what Russ addresses in sentence 2. :-)
The impression is that Friedl shows wonderfully how to
_use_ regular
expressions, but does not understand the theory behind their
implementation.
Yes, Friedl does show that wonderfully. From long-ago memory, Friedl
understands enough to have diagrams of NFAs and DFAs clocking through
their inputs, showing the differences in number of states, etc.
Yes, Friedl says an NFA must recursively backtrack. As Russ says in #3,
it was a ‘widespread belief’. Friedl didn't originate it; I ‘knew’ it
before reading his book. Friedl was at the sharp end of regexps,
needing to process large amounts of text, at Yahoo! IIRC. He
investigated how the programs available behaved; he didn't start at the
theory and come up with a new program best suited to his needs.
Personally, I'd stick with Russ's stuff,
especially as `egrep` is the
target here.
Russ's stuff is great. He refuted that widespread belief, for one
thing. But Russ isn't trying to teach a programmer how to best use the
regexp engine in sed, grep, egrep, Perl, PCRE, ... whereas Friedl takes
the many pages needed to do this.
It depends what one wants to learn first.
As Friedl says in the post Russ linked to:
‘As a user, you don't care if it's regular, nonregular, unregular,
irregular, or incontinent. So long as you know what you can expect
from it (something this chapter will show you), you know all you need
to care about.
‘For those wishing to learn more about the theory of regular expressions,
the classic computer-science text is chapter 3 of Aho, Sethi, and
Ullman's Compilers — Principles, Techniques, and Tools (Addison-Wesley,
1986), commonly called “The Dragon Book” due to the cover design.
More specifically, this is the “red dragon”. The “green dragon”
is its predecessor, Aho and Ullman's Principles of Compiler Design.’
In addition to the Dragon Book, Hopcroft and Ullman's ‘Automata Theory,
Languages, and Computation’ goes further into the subject. Chapter two
has DFA, NFA, epsilon transitions, and uses searching text as an
example. Chapter three is regular expressions, four is regular
languages. Pushdown automata is chapter six.
Too many books, not enough time to read. :-)
--
Cheers, Ralph.