Revision 7.10
2/28/2012
The regular expression
.
matches any character except a line terminator unless the
flag
is specified.
By default, the regular expressions
^
and
$
ignore line terminators and only match at the
beginning and the end, respectively, of the entire input sequence. If
mode is activated
then
^
matches at the beginning of input and after any line terminator except at the end of input.
When in
mode
$
matches just before a line terminator or the end of the input
sequence.
Groups and capturing
Capturing groups are numbered by counting their opening parentheses from left to right. In the
expression
((A)(B(C)))
, for example, there are four such groups:
1
((A)(B(C)))
2
(A)
3
(B(C))
4
(C)
Group zero always stands for the entire expression.
Capturing groups are so named because, during a match, each subsequence of the input sequence
that matches such a group is saved. The captured subsequence may be used later in the
expression, via a back reference, and may also be retrieved from the matcher once the match
operation is complete.
The captured input associated with a group is always the subsequence that the group most
recently matched. If a group is evaluated a second time because of quantification then its
previouslycaptured value, if any, will be retained if the second evaluation fails. Matching the
string
"aba"
against the expression
(a(b)?)+
, for example, leaves group two set to
"b"
. All
captured input is discarded at the beginning of each match.
Groups beginning with
(?
are pure,
noncapturing
groups that do not capture text and do not
count towards the group total.
Unicode support
Unicode Technical Report #18: Unicode Regular Expression Guidelines
implementing its second level of support though with a slightly different concrete syntax.
Page 216 of 228