B-8
User Guide for Cisco Security MARS Local Controller
78-17020-01
Appendix B Regular Expression Reference
Full Stop (Period, Dot)
Full Stop (Period, Dot)
Outside a character class, a dot in the pattern matches any one character in the subject, including a
non-printing character, but not (by default) newline. In UTF-8 mode, a dot matches any UTF-8 character,
which might be more than one byte long, except (by default) newline. If the PCRE_DOTALL option is
set, dots match newlines as well. The handling of dot is entirely independent of the handling of
circumflex and dollar, the only relationship being that they both involve newline characters. Dot has no
special meaning in a character class.
Matching a Single Byte
Outside a character class, the escape sequence \C matches any one byte, both in and out of UTF-8 mode.
Unlike a dot, it can match a newline. The feature is provided in Perl in order to match individual bytes
in UTF-8 mode. Because it breaks up UTF-8 characters into individual bytes, what remains in the string
may be a malformed UTF-8 string. For this reason, the \C escape sequence is best avoided.
PCRE does not allow \C to appear in lookbehind assertions (described below), because in UTF-8 mode
this would make it impossible to calculate the length of the lookbehind.
Square Brackets and Character Classes
An opening square bracket introduces a character class, terminated by a closing square bracket. A
closing square bracket on its own is not special. If a closing square bracket is required as a member of
the class, it should be the first data character in the class (after an initial circumflex, if present) or
escaped with a backslash.
A character class matches a single character in the subject. In UTF-8 mode, the character may occupy
more than one byte. A matched character must be in the set of characters defined by the class, unless the
first character in the class definition is a circumflex, in which case the subject character must not be in
the set defined by the class. If a circumflex is actually required as a member of the class, ensure it is not
the first character, or escape it with a backslash.
For example, the character class [aeiou] matches any lower case vowel, while [^aeiou] matches any
character that is not a lower case vowel. Note that a circumflex is just a convenient notation for
specifying the characters that are in the class by enumerating those that are not. A class that starts with
a circumflex is not an assertion: it still consumes a character from the subject string, and therefore it fails
if the current pointer is at the end of the string.
In UTF-8 mode, characters with values greater than 255 can be included in a class as a literal string of
bytes, or by using the \x{ escaping mechanism.
When caseless matching is set, any letters in a class represent both their upper case and lower case
versions, so for example, a caseless [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
match "A", whereas a caseful version would. When running in UTF-8 mode, PCRE supports the concept
of case for characters with values greater than 128 only when it is compiled with Unicode property
support.
The newline character is never treated in any special way in character classes, whatever the setting of the
PCRE_DOTALL or PCRE_MULTILINE options is. A class such as [^a] will always match a newline.