Introduction to Library Functions PCREPERFORM(3)
NAME
PCRE - Perl-compatible regular expressions
PCRE PERFORMANCE
Two aspects of performance are discussed below: memory usage
and processing time. The way you express your pattern as a
regular expression can affect both of them.
MEMORY USAGE
Patterns are compiled by PCRE into a reasonably efficient
byte code, so that most simple patterns do not use much
memory. However, there is one case where memory usage can be
unexpectedly large. When a parenthesized subpattern has a
quantifier with a minimum greater than 1 and/or a limited
maximum, the whole subpattern is repeated in the compiled
code. For example, the pattern
(abcdef){2,4}
is compiled as if it were
(abcdef)(abcdef)((abcdef)(abcdef)?)?
(Technical aside: It is done this way so that backtrack
points within each of the repetitions can be independently
maintained.)
For regular expressions whose quantifiers use only small
numbers, this is not usually a problem. However, if the
numbers are large, and particularly if such repetitions are
nested, the memory usage can become an embarrassment. For
example, the very simple pattern
((ab){1,1000}c){1,3}
uses 51K bytes when compiled. When PCRE is compiled with its
default internal pointer size of two bytes, the size limit
on a compiled pattern is 64K, and this is reached with the
above pattern if the outer repetition is increased from 3 to
4. PCRE can be compiled to use larger internal pointers and
thus handle larger compiled patterns, but it is better to
try to rewrite your pattern to use less memory if you can.
One way of reducing the memory usage for such patterns is to
make use of PCRE's "subroutine" facility. Re-writing the
above pattern as
((ab)(?2){0,999}c)(?1){0,2}
reduces the memory requirements to 18K, and indeed it
SunOS 5.10 Last change: 1
Introduction to Library Functions PCREPERFORM(3)
remains under 20K even with the outer repetition increased
to 100. However, this pattern is not exactly equivalent,
because the "subroutine" calls are treated as atomic groups
into which there can be no backtracking if there is a subse-
quent matching failure. Therefore, PCRE cannot do this kind
of rewriting automatically. Furthermore, there is a notice-
able loss of speed when executing the modified pattern.
Nevertheless, if the atomic grouping is not a problem and
the loss of speed is acceptable, this kind of rewriting will
allow you to process patterns that PCRE cannot otherwise
handle.
PROCESING TIME
Certain items in regular expression patterns are processed
more efficiently than others. It is more efficient to use a
character class like [aeiou] than a set of single-character
alternatives such as (aeiou). In general, the simplest
construction that provides the required behaviour is usually
the most efficient. Jeffrey Friedl's book contains a lot of
useful general discussion about optimizing regular expres-
sions for efficient performance. This document contains a
few observations about PCRE.
Using Unicode character properties (the \p, \P, and \X
escapes) is slow, because PCRE has to scan a structure that
contains data for over fifteen thousand characters whenever
it needs a character's property. If you can find an alterna-
tive pattern that does not use character properties, it will
probably be faster.
When a pattern begins with .* not in parentheses, or in
parentheses that are not the subject of a backreference, and
the PCREDOTAL option is set, the pattern is implicitly
anchored by PCRE, since it can match only at the start of a
subject string. However, if PCREDOTAL is not set, PCRE
cannot make this optimization, because the . metacharacter
does not then match a newline, and if the subject string
contains newlines, the pattern may match from the character
immediately following one of them instead of from the very
start. For example, the pattern
.*second
matches the subject "first\nand second" (where \n stands for
a newline character), with the match starting at the seventh
character. In order to do this, PCRE has to retry the match
starting after every newline in the subject.
If you are using such a pattern with subject strings that do
not contain newlines, the best performance is obtained by
setting PCREDOTAL, or starting the pattern with ^.* or
SunOS 5.10 Last change: 2
Introduction to Library Functions PCREPERFORM(3)
^.*? to indicate explicit anchoring. That saves PCRE from
having to scan along the subject looking for a newline to
restart at.
Beware of patterns that contain nested indefinite repeats.
These can take a long time to run when applied to a string
that does not match. Consider the pattern fragment
^(a])*
This can match "aaaa" in 16 different ways, and this number
increases very rapidly as the string gets longer. (The *
repeat can match 0, 1, 2, 3, or 4 times, and for each of
those cases other than 0 or 4, the ] repeats can match dif-
ferent numbers of times.) When the remainder of the pattern
is such that the entire match is going to fail, PCRE has in
principle to try every possible variation, and this can take
an extremely long time, even for relatively short strings.
An optimization catches some of the more simple cases such
as
(a])*b
where a literal character follows. Before embarking on the
standard matching procedure, PCRE checks that there is a "b"
later in the subject string, and if there is not, it fails
the match immediately. However, when there is no following
literal this optimization cannot be used. You can see the
difference by comparing the behaviour of
(a])*\d
with the pattern above. The former gives a failure almost
instantly when applied to a whole line of "a" characters,
whereas the latter takes an appreciable time with strings
longer than about 20 characters.
In many cases, the solution to this kind of performance
issue is to use an atomic group or a possessive quantifier.
AUTHOR
Philip Hazel
University Computing Service
Cambridge CB2 3QH, England.
REVISION
Last updated: 06 March 2007
Copyright (c) 1997-2007 University of Cambridge.
SunOS 5.10 Last change: 3
Introduction to Library Functions PCREPERFORM(3)
ATRIBUTES
See attributes(5) for descriptions of the following attri-
butes:
ATRIBUTE TYPE ATRIBUTE VALUE
Availability SUNWpcre
Interface Stability Uncommitted
NOTES
Source for PCRE is available on http:/opensolaris.org.
SunOS 5.10 Last change: 4
|