B-18
User Guide for Cisco Security MARS Local Controller
78-17020-01
Appendix B Regular Expression Reference
Assertions
(?<=ab(c|de))
is not permitted, because its single top-level branch can match two different lengths, but it is acceptable
if rewritten to use two top-level branches:
(?<=abc|abde)
The implementation of lookbehind assertions is, for each alternative, to temporarily move the current
position back by the fixed width and then try to match. If there are insufficient characters before the
current position, the match is deemed to fail.
PCRE does not allow the \C escape (which matches a single byte in UTF-8 mode) to appear in
lookbehind assertions, because it makes it impossible to calculate the length of the lookbehind. The \X
escape, which can match different numbers of bytes, is also not permitted.
Atomic groups can be used in conjunction with lookbehind assertions to specify efficient matching at
the end of the subject string. Consider a simple pattern such as
abcd$
when applied to a long string that does not match. Because matching proceeds from left to right, PCRE
will look for each "a" in the subject and then see if what follows matches the rest of the pattern. If the
pattern is specified as
^.*abcd$
the initial .* matches the entire string at first, but when this fails (because there is no following "a"), it
backtracks to match all but the last character, then all but the last two characters, and so on. Once again
the search for "a" covers the entire string, from right to left, so we are no better off. However, if the
pattern is written as
^(?>.*)(?<=abcd)
or, equivalently, using the possessive quantifier syntax,
^.*+(?<=abcd)
there can be no backtracking for the .* item; it can match only the entire string. The subsequent
lookbehind assertion does a single test on the last four characters. If it fails, the match fails immediately.
For long strings, this approach makes a significant difference to the processing time.
Using Multiple Assertions
Several assertions (of any sort) may occur in succession. For example,
(?<=\d{3})(?<!999)foo
matches "foo" preceded by three digits that are not "999". Notice that each of the assertions is applied
independently at the same point in the subject string. First there is a check that the previous three
characters are all digits, and then there is a check that the same three characters are not "999". This
pattern does
not
match "foo" preceded by six characters, the first of which are digits and the last three
of which are not "999". For example, it doesn't match "123abcfoo". A pattern to do that is