Introduction to Library Functions PCREMATCHING(3)
NAME
PCRE - Perl-compatible regular expressions
PCRE MATCHING ALGORITHMS
This document describes the two different algorithms that
are available in PCRE for matching a compiled regular
expression against a given subject string. The "standard"
algorithm is the one provided by the pcreexec() function.
This works in the same was as Perl's matching function, and
provides a Perl-compatible matching operation.
An alternative algorithm is provided by the pcredfaexec()
function; this operates in a different way, and is not
Perl-compatible. It has advantages and disadvantages com-
pared with the standard algorithm, and these are described
below.
When there is only one possible way in which a given subject
string can match a pattern, the two algorithms give the same
answer. A difference arises, however, when there are multi-
ple possibilities. For example, if the pattern
^<.*>
is matched against the string
there are three possible answers. The standard algorithm
finds only one of them, whereas the alternative algorithm
finds all three.
REGULAR EXPRESIONS AS TRES
The set of strings that are matched by a regular expression
can be represented as a tree structure. An unlimited repeti-
tion in the pattern makes the tree of infinite size, but it
is still a tree. Matching the pattern to a given subject
string (from a given starting point) can be thought of as a
search of the tree. There are two ways to search a tree:
depth-first and breadth-first, and these correspond to the
two matching algorithms provided by PCRE.
THE STANDARD MATCHING ALGORITHM
In the terminology of Jeffrey Friedl's book "Mastering Regu-
lar Expressions", the standard algorithm is an "NFA algo-
rithm". It conducts a depth-first search of the pattern
tree. That is, it proceeds along a single path through the
tree, checking that the subject matches what is required.
When there is a mismatch, the algorithm tries any
SunOS 5.10 Last change: 1
Introduction to Library Functions PCREMATCHING(3)
alternatives at the current point, and if they all fail, it
backs up to the previous branch point in the tree, and tries
the next alternative branch at that level. This often
involves backing up (moving to the left) in the subject
string as well. The order in which repetition branches are
tried is controlled by the greedy or ungreedy nature of the
quantifier.
If a leaf node is reached, a matching string has been found,
and at that point the algorithm stops. Thus, if there is
more than one possible match, this algorithm returns the
first one that it finds. Whether this is the shortest, the
longest, or some intermediate length depends on the way the
greedy and ungreedy repetition quantifiers are specified in
the pattern.
Because it ends up with a single path through the tree, it
is relatively straightforward for this algorithm to keep
track of the substrings that are matched by portions of the
pattern in parentheses. This provides support for capturing
parentheses and back references.
THE ALTERNATIVE MATCHING ALGORITHM
This algorithm conducts a breadth-first search of the tree.
Starting from the first matching point in the subject, it
scans the subject string from left to right, once, character
by character, and as it does this, it remembers all the
paths through the tree that represent valid matches. In
Friedl's terminology, this is a kind of "DFA algorithm",
though it is not implemented as a traditional finite state
machine (it keeps multiple states active simultaneously).
The scan continues until either the end of the subject is
reached, or there are no more unterminated paths. At this
point, terminated paths represent the different matching
possibilities (if there are none, the match has failed).
Thus, if there is more than one possible match, this algo-
rithm finds all of them, and in particular, it finds the
longest. In PCRE, there is an option to stop the algorithm
after the first match (which is necessarily the shortest)
has been found.
Note that all the matches that are found start at the same
point in the subject. If the pattern
cat(er(pillar)?)
is matched against the string "the caterpillar catchment",
the result will be the three strings "cat", "cater", and
"caterpillar" that start at the fourth character of the sub-
ject. The algorithm does not automatically move on to find
SunOS 5.10 Last change: 2
Introduction to Library Functions PCREMATCHING(3)
matches that start at later positions.
There are a number of features of PCRE regular expressions
that are not supported by the alternative matching algo-
rithm. They are as follows:
1. Because the algorithm finds all possible matches, the
greedy or ungreedy nature of repetition quantifiers is not
relevant. Greedy and ungreedy quantifiers are treated in
exactly the same way. However, possessive quantifiers can
make a difference when what follows could also match what is
quantified, for example in a pattern like this:
^a]\w!
This pattern matches "aaab!" but not "aaa!", which would be
matched by a non-possessive quantifier. Similarly, if an
atomic group is present, it is matched as if it were a stan-
dalone pattern at the current point, and the longest match
is then "locked in" for the rest of the overall pattern.
2. When dealing with multiple paths through the tree simul-
taneously, it is not straightforward to keep track of cap-
tured substrings for the different matching possibilities,
and PCRE's implementation of this algorithm does not attempt
to do this. This means that no captured substrings are
available.
3. Because no substrings are captured, back references
within the pattern are not supported, and cause errors if
encountered.
4. For the same reason, conditional expressions that use a
backreference as the condition or test for a specific group
recursion are not supported.
5. Because many paths through the tree may be active, the \K
escape sequence, which resets the start of the match when
encountered (but may be on some paths and not on others), is
not supported. It causes an error if encountered.
6. Callouts are supported, but the value of the capturetop
field is always 1, and the value of the capturelast field
is always -1.
7. The \C escape sequence, which (in the standard algorithm)
matches a single byte, even in UTF-8 mode, is not supported
because the alternative algorithm moves through the subject
string one character at a time, for all active paths through
the tree.
SunOS 5.10 Last change: 3
Introduction to Library Functions PCREMATCHING(3)
8. Except for (*FAIL), the backtracking control verbs such
as (*PRUNE) are not supported. (*FAIL) is supported, and
behaves like a failing negative assertion.
ADVANTAGES OF THE ALTERNATIVE ALGORITHM
Using the alternative matching algorithm provides the fol-
lowing advantages:
1. All possible matches (at a single point in the subject)
are automatically found, and in particular, the longest
match is found. To find more than one match using the stan-
dard algorithm, you have to do kludgy things with callouts.
2. There is much better support for partial matching. The
restrictions on the content of the pattern that apply when
using the standard algorithm for partial matching do not
apply to the alternative algorithm. For non-anchored pat-
terns, the starting position of a partial match is avail-
able.
3. Because the alternative algorithm scans the subject
string just once, and never needs to backtrack, it is possi-
ble to pass very long subject strings to the matching func-
tion in several pieces, checking for partial matching each
time.
DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
The alternative algorithm suffers from a number of disadvan-
tages:
1. It is substantially slower than the standard algorithm.
This is partly because it has to search for all possible
matches, but is also because it is less susceptible to
optimization.
2. Capturing parentheses and back references are not sup-
ported.
3. Although atomic groups are supported, their use does not
provide the performance advantage that it does for the stan-
dard algorithm.
AUTHOR
Philip Hazel
University Computing Service
Cambridge CB2 3QH, England.
REVISION
SunOS 5.10 Last change: 4
Introduction to Library Functions PCREMATCHING(3)
Last updated: 19 April 2008
Copyright (c) 1997-2008 University of Cambridge.
ATRIBUTES
See attributes(5) for descriptions of the following attri-
butes:
ATRIBUTE TYPE ATRIBUTE VALUE
Availability SUNWpcre
Interface Stability Uncommitted
NOTES
Source for PCRE is available on http:/opensolaris.org.
SunOS 5.10 Last change: 5
|