Provides an implementation of regular expressions, which is useful for
matching, searching, and replacing strings based on patterns. The two
fundamental classes are Pattern
and
Matcher
. The former
takes a pattern described by means of a regular expression and compiles it
into a special internal representation. The latter matches the compiled
pattern against a given input.
The following table gives some basic examples of regular expressions and input strings that match them:
Regular expression | Matched string(s) |
---|---|
"Hello, World!" | "Hello, World!" |
"Hello, World." | "Hello, World!", "Hello, World?" |
"Hello, .*d!" | "Hello, World!", "Hello, Android!", "Hello, Dad!" |
"[0-9]+ green bottles" | "0 green bottles", "25 green bottles", "1234 green bottles" |
The following section describe the various features in detail. The are also some implementation notes at the end.
Meta character | Description |
---|---|
\a | Match a BELL, \u0007. |
\A | Match at the beginning of the input. Differs from ^ in that \A will not match after a new line within the input. |
\b, outside of a character set | Match if the current position is a word boundary. Boundaries occur at the transitions between word (\w) and non-word (\W) characters, with combining marks ignored. |
\b, within a character set | Match a BACKSPACE, \u0008. |
\B | Match if the current position is not a word boundary. |
\cX | Match a control-X character (replace X with actual character). |
\e | Match an ESCAPE, \u001B. |
\E | Ends quoting started by \Q. Meta characters, character classes, and operators become active again. |
\f | Match a FORM FEED, \u000C. |
\G | Match if the current position is at the end of the previous match. |
\n | Match a LINE FEED, \u000A. |
\N{UNICODE CHARACTER NAME} | Match the named Unicode character. |
\Q | Quotes all following characters until \E. The following text is treated as literal. |
\r | Match a CARRIAGE RETURN, \u000D. |
\t | Match a HORIZONTAL TABULATION, \u0009. |
\uhhhh | Match the character with the hex value hhhh. |
\Uhhhhhhhh | Match the character with the hex value hhhhhhhh. Exactly eight hex digits must be provided, even though the largest Unicode code point is \U0010ffff. |
\x{hhhh} | Match the character with the hex value hhhh. From one to six hex digits may be supplied. |
\xhh | Match the character with the hex value hh. |
\Z | Match if the current position is at the end of input, but before the final line terminator, if one exists. |
\z | Match if the current position is at the end of input. |
\0n, \0nn, \0nnn | Match the character with the octal value n, nn, or nnn. Maximum value is 0377. |
\n | Back Reference. Match whatever the nth capturing group matched. n must be a number > 1 and < total number of capture groups in the pattern. Note: Octal escapes, such as \012, are not supported in ICU regular expressions |
[character set] | Match any one character from the character set. See character sets for a full description of what may appear between the angular brackets. |
. | Match any character. |
^ | Match at the beginning of a line. |
$ | Match at the end of a line. |
\ | Quotes the following character, so that is loses any special meaning it might have. |
Element | Description |
---|---|
[a] | The character set consisting of the letter 'a' only. |
[xyz] | The character set consisting of the letters 'x', 'y', and 'z', described by explicit enumeration. |
[x-z] | The character set consisting of the letters 'x', 'y', and 'z', described by means of a range. |
[^xyz] | The character set consisting of everything but the letters 'x', 'y', and 'z'. |
[[a-f][0-9]] | The character set formed by building the union of the two character sets [a-f] and [0-9]. |
[[a-z]&&[jkl]] | The character set formed by building the intersection of the two character sets [a-z] and [jkl]. You can also use a single '&', but this regular expression might not be portable. |
[[a-z]--[jkl]] | The character set formed by building the difference of the two character sets [a-z] and [jkl]. You can also use a single '-'. This operator is generally not portable. |
A couple of frequently used character sets are predefined and named. These can be referenced by their name, but behave otherwise similar to explicit character sets. The following table lists them:
Character set | Description |
---|---|
\d, \D | The set consisting of all digit characters (\d) or the opposite of it (\D). |
\s, \S | The set consisting of all space characters (\s) or the opposite of it (\S). |
\w, \W | The set consisting of all word characters (\w) or the opposite of it (\W). |
\X | The set of all grapheme clusters. |
\p{NAME}, \P{NAME} | The Posix set with the specified NAME (\p{}) or the opposite of it (\P{}) - Legal values for NAME are 'Alnum', 'Alpha', 'ASCII', 'Blank', 'Cntrl', 'Digit', 'Graph', 'Lower', 'Print', 'Punct', 'Upper', 'XDigit' . |
\p{inBLOCK}, \P{inBLOCK} | The character set equivalent to the given Unicode BLOCK (\p{}) or the opposite of it (\P{}). An example for a legal BLOCK name is 'Hebrew', meaning, unsurprisingly, all Hebrew characters. |
\p{CATEGORY}, \P{CATEGORY} | The character set equivalent to the Unicode CATEGORY (\p{}) or the opposite of it (\P{}). An example for a legal CATEGORY name is 'Lu', meaning all uppercase letters. |
\p{javaMETHOD}, \P{javaMETHOD} |
The character set equivalent to the isMETHOD() operation of the
Character class (\p{}) or the opposite of it (\P{}).
|
Operator | Description |
---|---|
| | Alternation. A|B matches either A or B. |
* | Match 0 or more times. Match as many times as possible. |
+ | Match 1 or more times. Match as many times as possible. |
? | Match zero or one times. Prefer one. |
{n} | Match exactly n times |
{n,} | Match at least n times. Match as many times as possible. |
{n,m} | Match between n and m times. Match as many times as possible, but not more than m. |
*? | Match 0 or more times. Match as few times as possible. |
+? | Match 1 or more times. Match as few times as possible. |
?? | Match zero or one times. Prefer zero. |
{n}? | Match exactly n times. |
{n,}? | Match at least n times, but no more than required for an overall pattern match |
{n,m}? | Match between n and m times. Match as few times as possible, but not less than n. |
*+ | Match 0 or more times. Match as many times as possible when first encountered, do not retry with fewer even if overall match fails (Possessive Match) |
++ | Match 1 or more times. Possessive match. |
?+ | Match zero or one times. Possessive match. |
{n}+ | Match exactly n times. |
{n,}+ | Match at least n times. Possessive Match. |
{n,m}+ | Match between n and m times. Possessive Match. |
( ... ) | Capturing parentheses. Range of input that matched the parenthesized subexpression is available after the match. |
(?: ... ) | Non-capturing parentheses. Groups the included pattern, but does not provide capturing of matching text. Somewhat more efficient than capturing parentheses. |
(?> ... ) | Atomic-match parentheses. First match of the parenthesized subexpression is the only one tried; if it does not lead to an overall pattern match, back up the search for a match to a position before the "(?>" |
(?# ... ) | Free-format comment (?# comment ). |
(?= ... ) | Look-ahead assertion. True if the parenthesized pattern matches at the current input position, but does not advance the input position. |
(?! ... ) | Negative look-ahead assertion. True if the parenthesized pattern does not match at the current input position. Does not advance the input position. |
(?<= ... ) | Look-behind assertion. True if the parenthesized pattern matches text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators.) |
(?<! ... ) | Negative Look-behind assertion. True if the parenthesized pattern does not match text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators.) |
(?ismwx-ismwx: ... ) | Flag settings. Evaluate the parenthesized expression with the specified flags enabled or -disabled. |
(?ismwx-ismwx) | Flag settings. Change the flag settings. Changes apply to the portion of the pattern following the setting. For example, (?i) changes to a case insensitive match. |
CASE_INSENSITIVE
flag silently
assumes Unicode case-insensitivity. That is, the
UNICODE_CASE
flag is effectively a
no-op.
CANON_EQ
flag is not supported at
all (throws an exception).