B-21
User Guide for Cisco Security MARS Local Controller
78-17020-01
Appendix B Regular Expression Reference
Subpatterns as Subroutines
We have put the pattern into parentheses, and caused the recursion to refer to them instead of the whole
pattern. In a larger pattern, keeping track of parenthesis numbers can be tricky. It may be more
convenient to use named parentheses instead. For this, PCRE uses (?P>name), which is an extension to
the Python syntax that PCRE uses for named parentheses (Perl does not provide named parentheses). We
could rewrite the above example as follows:
(?P<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )
This particular example pattern contains nested unlimited repeats, and so the use of atomic grouping for
matching strings of non-parentheses is important when applying the pattern to strings that do not match.
For example, when this pattern is applied to
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
it yields "no match" quickly. However, if atomic grouping is not used, the match runs for a very long
time indeed because there are so many different ways the + and * repeats can carve up the subject, and
all have to be tested before failure can be reported.
At the end of a match, the values set for any capturing subpatterns are those from the outermost level of
the recursion at which the subpattern value is set. If you want to obtain intermediate values, a callout
function can be used (see
Subpatterns as Subroutines, page B-21
and the
pcrecallout
documentation).
If the pattern above is matched against
(ab(cd)ef)
the value for the capturing parentheses is "ef", which is the last value taken on at the top level. If
additional parentheses are added, giving
\( ( ( (?>[^()]+) | (?R) )* ) \)
^ ^
^ ^
the string they capture is "ab(cd)ef", the cont
ents of the top level parentheses. If there are more than 15 capturing parentheses in a pattern, PCRE has
to obtain extra memory to store data during a recursion, which it does by using
pcre_malloc
, freeing it
via
pcre_free
afterwards. If no memory can be obtained, the match fails with the
PCRE_ERROR_NOMEMORY error.
Do not confuse the (?R) item with the condition (R), which tests for recursion. Consider this pattern,
which matches text in angle brackets, allowing for arbitrary nesting. Only digits are allowed in nested
brackets (that is, when recursing), whereas any characters are permitted at the outer level.
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
In this pattern, (?(R) is the start of a conditional subpattern, with two different alternatives for the
recursive and non-recursive cases. The (?R) item is the actual recursive call.
Subpatterns as Subroutines
If the syntax for a recursive subpattern reference (either by number or by name) is used outside the
parentheses to which it refers, it operates like a subroutine in a programming language. An earlier
example pointed out that the pattern
(sens|respons)e and \1ibility