MyWebUniversity.com Home Page
 



OpenSolaris man pages main menu


Introduction to Library Functions                         PCRE(3)



NAME
     PCRE - Perl-compatible regular expressions

INTRODUCTION

     The PCRE library is a set of functions that implement  regu-
     lar  expression  pattern  matching using the same syntax and
     semantics as Perl, with  just  a  few  differences.  Certain
     features  that  appeared  in  Python  and  PCRE  before they
     appeared in Perl are also available using the Python syntax.
     There  is  also  some support for certain .NET and Oniguruma
     syntax items, and there is an  option  for  requesting  some
     minor changes that give better JavaScript compatibility.

     The current implementation of PCRE (release 7.x) corresponds
     approximately  with  Perl  5.10, including support for UTF-8
     encoded strings and  Unicode  general  category  properties.
     However,  UTF-8  and  Unicode  support  has to be explicitly
     enabled;  it  is  not  the  default.  The   Unicode   tables
     correspond to Unicode release 5.0.0.

     In addition to the Perl-compatible matching  function,  PCRE
     contains  an  alternative matching function that matches the
     same compiled patterns in a different way. In  certain  cir-
     cumstances,  the  alternative  function has some advantages.
     For a discussion of the two  matching  algorithms,  see  the
     pcrematching page.

     PCRE is written in C and released as a C library.  A  number
     of  people  have  written wrappers and interfaces of various
     kinds.  In  particular,  Google  Inc.    have   provided   a
     comprehensive  C]  wrapper. This is now included as part of
     the PCRE distribution. The pcrecpp page has details of  this
     interface.  Other people's contributions can be found in the
     Contrib directory at the primary FTP site, which is:

     ftp:/ftp.csx.cam.ac.uk/pub/software/programming/pcre

     Details of exactly which Perl  regular  expression  features
     are  and  are  not  supported  by PCRE are given in separate
     documents. See the pcrepattern and pcrecompat  pages.  There
     is a syntax summary in the pcresyntax page.

     Some features of PCRE can be included, excluded, or  changed
     when  the library is built. The pcreconfig() function makes
     it possible for a client  to  discover  which  features  are
     available.  The  features  themselves  are  described in the
     pcrebuild page. Documentation about building PCRE for  vari-
     ous operating systems can be found in the README file in the
     source distribution.





SunOS 5.10                Last change:                          1






Introduction to Library Functions                         PCRE(3)



     The library contains a number of undocumented internal func-
     tions  and data tables that are used by more than one of the
     exported external functions, but which are not intended  for
     use   by  external  callers.  Their  names  all  begin  with
     "pcre", which hopefully will not provoke any name clashes.
     In some environments, it is possible to control which exter-
     nal symbols are exported when a shared library is built, and
     in these cases the undocumented symbols are not exported.

USER DOCUMENTATION

     The user documentation for PCRE comprises a number  of  dif-
     ferent  sections.  In  the  "man" format, each of these is a
     separate "man page". In the HTML format, each is a  separate
     page,  linked from the index page. In the plain text format,
     all the sections are concatenated, for  ease  of  searching.
     The sections are as follows:

       pcre              this document
       pcre-config        show  PCRE  installation  configuration
     information
       pcreapi           details of PCRE's native C API
       pcrebuild         options for building PCRE
       pcrecallout       details of the callout feature
       pcrecompat        discussion of Perl compatibility
       pcrecpp           details of the C] wrapper
       pcregrep          description of the pcregrep command
       pcrematching      discussion of  the  two  matching  algo-
     rithms
       pcrepartial       details of the partial matching facility
       pcrepattern       syntax and semantics of supported
                           regular expressions
       pcresyntax        quick syntax reference
       pcreperform       discussion of performance issues
       pcreposix         the POSIX-compatible C API
       pcreprecompile    details of saving and  re-using  precom-
     piled patterns
       pcresample        discussion of the sample program
       pcrestack         discussion of stack usage
       pcretest          description of the pcretest testing com-
     mand

     In addition, in the "man" and HTML formats, there is a short
     page  for each C library function, listing its arguments and
     results.

LIMITATIONS

     There are some size limitations in PCRE but it is hoped that
     they will never in practice be relevant.





SunOS 5.10                Last change:                          2






Introduction to Library Functions                         PCRE(3)



     The maximum length of a  compiled  pattern  is  65539  (sic)
     bytes  if PCRE is compiled with the default internal linkage
     size of 2. If you want to process regular  expressions  that
     are  truly  enormous,  you can compile PCRE with an internal
     linkage size of 3 or 4 (see the README file  in  the  source
     distribution  and  the pcrebuild documentation for details).
     In these cases the limit is substantially larger.   However,
     the speed of execution is slower.

     All values in repeating quantifiers must be less than 65536.

     There is no limit to the  number  of  parenthesized  subpat-
     terns, but there can be no more than 65535 capturing subpat-
     terns.

     The maximum length of name for  a  named  subpattern  is  32
     characters,  and  the maximum number of named subpatterns is
     10000.

     The maximum length of a subject string is the largest  posi-
     tive number that an integer variable can hold. However, when
     using the traditional matching function, PCRE uses recursion
     to handle subpatterns and indefinite repetition.  This means
     that the available stack space may limit the size of a  sub-
     ject string that can be processed by certain patterns. For a
     discussion of stack issues, see the pcrestack documentation.

UTF-8 AND UNICODE PROPERTY SUPORT

     From release 3.3, PCRE has had some  support  for  character
     strings  encoded  in  the UTF-8 format. For release 4.0 this
     was greatly extended to cover most common requirements,  and
     in  release  5.0  additional  support  for  Unicode  general
     category properties was added.

     In order process UTF-8  strings,  you  must  build  PCRE  to
     include  UTF-8  support  in  the code, and, in addition, you
     must call pcrecompile() with  the  PCREUTF8  option  flag.
     When  you  do this, both the pattern and any subject strings
     that are matched against it are  treated  as  UTF-8  strings
     instead of just strings of bytes.

     If you compile PCRE with UTF-8 support, but do not use it at
     run  time,  the  library will be a bit bigger, but the addi-
     tional run time overhead is limited to testing the PCREUTF8
     flag occasionally, so should not be very big.

     If PCRE is built with  Unicode  character  property  support
     (which  implies UTF-8 support), the escape sequences \p{..},
     \P{..}, and \X are supported.  The available properties that
     can be tested are limited to the general category properties
     such as Lu for an upper case letter  or  Nd  for  a  decimal



SunOS 5.10                Last change:                          3






Introduction to Library Functions                         PCRE(3)



     number,  the Unicode script names such as Arabic or Han, and
     the derived properties Any and L&. A full list is  given  in
     the pcrepattern documentation. Only the short names for pro-
     perties are supported. For example, \p{L} matches a  letter.
     Its  Perl  synonym,  \p{Letter}, is not supported.  Further-
     more, in Perl, many properties may optionally be prefixed by
     "Is", for compatibility with Perl 5.6. PCRE does not support
     this.

  Validity of UTF-8 strings

     When you set the PCREUTF8 flag, the strings passed as  pat-
     terns  and subjects are (by default) checked for validity on
     entry to the relevant functions. From release 7.3  of  PCRE,
     the  check  is  according  the  rules of RFC 3629, which are
     themselves derived from the Unicode  specification.  Earlier
     releases  of  PCRE  followed  the  rules  of RFC 2279, which
     allows the full range of 31-bit values  (0  to  0x7F).
     The  current  check  allows  only values in the range U]0 to
     U]10F, excluding U]D800 to U]DF.

     The excluded code points are the  "Low  Surrogate  Area"  of
     Unicode,  of  which the Unicode Standard says this: "The Low
     Surrogate Area does not contain any  character  assignments,
     consequently  no character code charts or namelists are pro-
     vided for this area. Surrogates are reserved  for  use  with
     UTF-16 and then must be used in pairs." The code points that
     are encoded by UTF-16 pairs  are  available  as  independent
     code  points  in  the  UTF-8  encoding. (In other words, the
     whole surrogate thing is a fudge  for  UTF-16  which  unfor-
     tunately messes up UTF-8.)

     If an invalid UTF-8 string  is  passed  to  PCRE,  an  error
     return  (PCRERORBADUTF8)  is  given. In some situations,
     you may already know that your strings are valid, and there-
     fore  want  to skip these checks in order to improve perfor-
     mance. If you set the  PCRENOUTF8CHECK  flag  at  compile
     time  or  at run time, PCRE assumes that the pattern or sub-
     ject it is given (respectively) contains  only  valid  UTF-8
     codes.  In  this case, it does not diagnose an invalid UTF-8
     string.

     If you pass an invalid UTF-8 string when  PCRENOUTF8CHECK
     is  set,  what happens depends on why the string is invalid.
     If the string conforms to the "old" definition of UTF-8 (RFC
     2279),  it  is  processed  as  a string of characters in the
     range 0 to 0x7F. In other words, apart from  the  ini-
     tial  validity  test,  PCRE  (when  in  UTF-8  mode) handles
     strings according to the more liberal  rules  of  RFC  2279.
     However,  if  the  string does not even conform to RFC 2279,
     the result is undefined. Your program may crash.




SunOS 5.10                Last change:                          4






Introduction to Library Functions                         PCRE(3)



     If you want to process strings of values in the full range 0
     to 0x7F, encoded in a UTF-8-like manner as per the old
     RFC, you can set PCRENOUTF8CHECK to bypass the more  res-
     trictive  test. However, in this situation, you will have to
     apply your own validity check.

  General comments about UTF-8 mode

     1. An unbraced hexadecimal escape sequence  (such  as  \xb3)
     matches  a  two-byte UTF-8 character if the value is greater
     than 127.

     2. Octal numbers up to \777 are recognized, and  match  two-
     byte UTF-8 characters for values greater than \177.

     3. Repeat quantifiers apply to  complete  UTF-8  characters,
     not to individual bytes, for example: \x{100}{3}.

     4. The dot metacharacter matches one UTF-8 character instead
     of a single byte.

     5. The escape sequence \C can be used to match a single byte
     in UTF-8 mode, but its use can lead to some strange effects.
     This facility is not available in the  alternative  matching
     function, pcredfaexec().

     6. The character escapes \b, \B, \d, \D, \s, \S, \w, and  \W
     correctly test characters of any code value, but the charac-
     ters that PCRE recognizes as digits, spaces, or word charac-
     ters  remain  the  same  set as before, all with values less
     than 256. This remains true even when PCRE includes  Unicode
     property  support,  because  to do otherwise would slow down
     PCRE in many common cases. If you really want to test for  a
     wider  sense of, say, "digit", you must use Unicode property
     tests such as \p{Nd}.

     7. Similarly, characters that match the POSIX named  charac-
     ter classes are all low-valued characters.

     8. However, the Perl 5.10 horizontal and vertical whitespace
     matching  escapes  (\h,  \H,  \v,  and  \V) do match all the
     appropriate Unicode characters.

     9. Case-insensitive  matching  applies  only  to  characters
     whose  values  are  less than 128, unless PCRE is built with
     Unicode property support. Even when Unicode property support
     is  available, PCRE still uses its own character tables when
     checking the case of low-valued characters,  so  as  not  to
     degrade  performance.   The  Unicode property information is
     used only for  characters  with  higher  values.  Even  when
     Unicode  property  support is available, PCRE supports case-
     insensitive matching only when there is a one-to-one mapping



SunOS 5.10                Last change:                          5






Introduction to Library Functions                         PCRE(3)



     between  a letter's cases. There are a small number of many-
     to-one mappings in Unicode; these are not supported by PCRE.

AUTHOR

     Philip Hazel
     University Computing Service
     Cambridge CB2 3QH, England.

     Putting an actual email address here seems to  have  been  a
     spam magnet, so I've taken it away. If you want to email me,
     use my two initials, followed by the two digits 10,  at  the
     domain cam.ac.uk.

REVISION

     Last updated: 12 April 2008
     Copyright (c) 1997-2008 University of Cambridge.

ATRIBUTES
     See attributes(5) for descriptions of the  following  attri-
     butes:

     
       ATRIBUTE TYPE     ATRIBUTE VALUE
    
     Availability         SUNWpcre       
    
     Interface Stability  Uncommitted    
    

NOTES
     Source for PCRE is available on http:/opensolaris.org.






















SunOS 5.10                Last change:                          6



OpenSolaris man pages main menu

Contact us      |       About us      |       Term of use      |       Copyright © 2000-2010 MyWebUniversity.com ™