MyWebUniversity.com Home Page
 



OpenSolaris man pages main menu


Introduction to Library Functions                  PCREPERFORM(3)



NAME
     PCRE - Perl-compatible regular expressions

PCRE PERFORMANCE

     Two aspects of performance are discussed below: memory usage
     and  processing  time. The way you express your pattern as a
     regular expression can affect both of them.

MEMORY USAGE

     Patterns are compiled by PCRE into  a  reasonably  efficient
     byte  code,  so  that  most  simple patterns do not use much
     memory. However, there is one case where memory usage can be
     unexpectedly  large.  When  a parenthesized subpattern has a
     quantifier with a minimum greater than 1  and/or  a  limited
     maximum,  the  whole  subpattern is repeated in the compiled
     code. For example, the pattern

       (abcdef){2,4}

     is compiled as if it were

       (abcdef)(abcdef)((abcdef)(abcdef)?)?

     (Technical aside: It is done  this  way  so  that  backtrack
     points  within  each of the repetitions can be independently
     maintained.)

     For regular expressions whose  quantifiers  use  only  small
     numbers,  this  is  not  usually  a problem. However, if the
     numbers are large, and particularly if such repetitions  are
     nested,  the  memory  usage can become an embarrassment. For
     example, the very simple pattern

       ((ab){1,1000}c){1,3}

     uses 51K bytes when compiled. When PCRE is compiled with its
     default  internal  pointer size of two bytes, the size limit
     on a compiled pattern is 64K, and this is reached  with  the
     above pattern if the outer repetition is increased from 3 to
     4. PCRE can be compiled to use larger internal pointers  and
     thus  handle  larger  compiled patterns, but it is better to
     try to rewrite your pattern to use less memory if you can.

     One way of reducing the memory usage for such patterns is to
     make  use  of  PCRE's  "subroutine" facility. Re-writing the
     above pattern as

       ((ab)(?2){0,999}c)(?1){0,2}

     reduces the  memory  requirements  to  18K,  and  indeed  it



SunOS 5.10                Last change:                          1






Introduction to Library Functions                  PCREPERFORM(3)



     remains  under  20K even with the outer repetition increased
     to 100. However, this pattern  is  not  exactly  equivalent,
     because  the "subroutine" calls are treated as atomic groups
     into which there can be no backtracking if there is a subse-
     quent  matching failure. Therefore, PCRE cannot do this kind
     of rewriting automatically.  Furthermore, there is a notice-
     able  loss  of  speed  when  executing the modified pattern.
     Nevertheless, if the atomic grouping is not  a  problem  and
     the loss of speed is acceptable, this kind of rewriting will
     allow you to process patterns  that  PCRE  cannot  otherwise
     handle.

PROCESING TIME

     Certain items in regular expression patterns  are  processed
     more  efficiently than others. It is more efficient to use a
     character class like [aeiou] than a set of  single-character
     alternatives  such  as (aeiou). In general, the simplest
     construction that provides the required behaviour is usually
     the  most efficient. Jeffrey Friedl's book contains a lot of
     useful general discussion about optimizing  regular  expres-
     sions  for  efficient  performance. This document contains a
     few observations about PCRE.

     Using Unicode character  properties  (the  \p,  \P,  and  \X
     escapes)  is slow, because PCRE has to scan a structure that
     contains data for over fifteen thousand characters  whenever
     it needs a character's property. If you can find an alterna-
     tive pattern that does not use character properties, it will
     probably be faster.

     When a pattern begins with .*  not  in  parentheses,  or  in
     parentheses that are not the subject of a backreference, and
     the PCREDOTAL option is set,  the  pattern  is  implicitly
     anchored  by PCRE, since it can match only at the start of a
     subject string. However, if PCREDOTAL  is  not  set,  PCRE
     cannot  make  this optimization, because the . metacharacter
     does not then match a newline, and  if  the  subject  string
     contains  newlines, the pattern may match from the character
     immediately following one of them instead of from  the  very
     start. For example, the pattern

       .*second

     matches the subject "first\nand second" (where \n stands for
     a newline character), with the match starting at the seventh
     character. In order to do this, PCRE has to retry the  match
     starting after every newline in the subject.

     If you are using such a pattern with subject strings that do
     not  contain  newlines,  the best performance is obtained by
     setting PCREDOTAL, or starting the  pattern  with  ^.*  or



SunOS 5.10                Last change:                          2






Introduction to Library Functions                  PCREPERFORM(3)



     ^.*?  to  indicate  explicit anchoring. That saves PCRE from
     having to scan along the subject looking for  a  newline  to
     restart at.

     Beware of patterns that contain nested  indefinite  repeats.
     These  can  take a long time to run when applied to a string
     that does not match. Consider the pattern fragment

       ^(a])*

     This can match "aaaa" in 16 different ways, and this  number
     increases  very  rapidly  as  the string gets longer. (The *
     repeat can match 0, 1, 2, 3, or 4 times,  and  for  each  of
     those  cases other than 0 or 4, the ] repeats can match dif-
     ferent numbers of times.) When the remainder of the  pattern
     is  such that the entire match is going to fail, PCRE has in
     principle to try every possible variation, and this can take
     an extremely long time, even for relatively short strings.

     An optimization catches some of the more simple  cases  such
     as

       (a])*b

     where a literal character follows. Before embarking  on  the
     standard matching procedure, PCRE checks that there is a "b"
     later in the subject string, and if there is not,  it  fails
     the  match  immediately. However, when there is no following
     literal this optimization cannot be used. You  can  see  the
     difference by comparing the behaviour of

       (a])*\d

     with the pattern above. The former gives  a  failure  almost
     instantly  when  applied  to a whole line of "a" characters,
     whereas the latter takes an appreciable  time  with  strings
     longer than about 20 characters.

     In many cases, the solution  to  this  kind  of  performance
     issue is to use an atomic group or a possessive quantifier.

AUTHOR

     Philip Hazel
     University Computing Service
     Cambridge CB2 3QH, England.

REVISION

     Last updated: 06 March 2007
     Copyright (c) 1997-2007 University of Cambridge.




SunOS 5.10                Last change:                          3






Introduction to Library Functions                  PCREPERFORM(3)



ATRIBUTES
     See attributes(5) for descriptions of the  following  attri-
     butes:

     
       ATRIBUTE TYPE     ATRIBUTE VALUE
    
     Availability         SUNWpcre       
    
     Interface Stability  Uncommitted    
    

NOTES
     Source for PCRE is available on http:/opensolaris.org.









































SunOS 5.10                Last change:                          4



OpenSolaris man pages main menu

Contact us      |       About us      |       Term of use      |       Copyright © 2000-2010 MyWebUniversity.com ™