B-15
User Guide for Cisco Security MARS Local Controller
78-17020-01
Appendix B Regular Expression Reference
Back References
(?>\d+)foo
This kind of parenthesis "locks up" the part of the pattern it contains once it has matched, and a failure
further into the pattern is prevented from backtracking into it. Backtracking past it to previous items,
however, works as normal.
An alternative description is that a subpattern of this type matches the string of characters that an
identical standalone pattern would match, if anchored at the current point in the subject string.
Atomic grouping subpatterns are not capturing subpatterns. Simple cases such as the above example can
be thought of as a maximizing repeat that must swallow everything it can. So, while both \d+ and \d+?
are prepared to adjust the number of digits they match in order to make the rest of the pattern match,
(?>\d+) can only match an entire sequence of digits.
Atomic groups in general can of course contain arbitrarily complicated subpatterns, and can be nested.
However, when the subpattern for an atomic group is just a single repeated item, as in the example above,
a simpler notation, called a "possessive quantifier" can be used. This consists of an addi character
following a quantifier. Using this notation, the previous example can be rewritten as
\d++foo
Possessive quantifiers are always greedy; the setting of the PCRE_UNGREEDY option is ignored. They
are a convenient notation for the simpler forms of atomic group. However, there is no difference in the
meaning or processing of a possessive quantifier and the equivalent atomic group.
The possessive quantifier syntax is an extension to the Perl syntax. It originates in Sun's Java package.
When a pattern contains an unlimited repeat inside a subpattern that can itself be repeated an unlimited
number of times, the use of an atomic group is the only way to avoid some failing matches taking a very
long time indeed. The pattern
(\D+|<\d+>)*[!?]
matches an unlimited number of substrings that either consist of non-digits, or digits enclosed in <>,
followed by either ! or ?. When it matches, it runs quickly. However, if it is applied to
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
it takes a long time before reporting failure. This is because the string can be divided between the internal
\D+ repeat and the external * repeat in a large number of ways, and all have to be tried. (The example
uses [!?] rather than a single character at the end, because both PCRE and Perl have an optimization that
allows for fast failure when a single character is used. They remember the last single character that is
required for a match, and fail early if it is not present in the string.) If the pattern is changed so that it
uses an atomic group, like this:
((?>\D+)|<\d+>)*[!?]
sequences of non-digits cannot be broken, and failure happens quickly.
Back References
Outside a character class, a backslash followed by a digit greater than 0 (and possibly further digits) is
a back reference to a capturing subpattern earlier (that is, to its left) in the pattern, provided there have
been that many previous capturing left parentheses.