java.lang.Object | |
↳ | java.util.regex.Pattern |
Patterns are compiled regular expressions. In many cases, convenience methods such as
String.matches
, String.replaceAll
and
String.split
will be preferable, but if you need to do a lot of work
with the same regular expression, it may be more efficient to compile it once and reuse it.
The Pattern
class and its companion, Matcher
, are also a lot more powerful
than the small amount of functionality exposed by String
.
// String convenience methods: boolean sawFailures = s.matches("Failures: \d+"); String farewell = s.replaceAll("Hello, (\S+)", "Goodbye, $1"); String[] fields = s.split(":"); // Direct use of Pattern: Pattern p = Pattern.compile("Hello, (\S+)"); Matcher m = p.matcher(inputString); while (m.find()) { // Find each match in turn; String can't do this. String name = m.group(1); // Access a submatch group; String can't do this. }
Java supports a subset of Perl 5 regular expression syntax. An important gotcha is that Java
has no regular expression literals, and uses plain old string literals instead. This means that
you need an extra level of escaping. For example, the regular expression \s+
has to
be represented as the string "\\s+"
.
\ | Quote the following metacharacter (so \. matches a literal . ). |
\Q | Quote all following metacharacters until \E . |
\E | Stop quoting metacharacters (started by \Q ). |
\\ | A literal backslash. |
\uhhhh | The Unicode character U+hhhh (in hex). |
\xhh | The Unicode character U+00hh (in hex). |
\cx | The ASCII control character ^x (so \cH would be ^H, U+0008). |
\a | The ASCII bell character (U+0007). |
\e | The ASCII ESC character (U+001b). |
\f | The ASCII form feed character (U+000c). |
\n | The ASCII newline character (U+000a). |
\r | The ASCII carriage return character (U+000d). |
\t | The ASCII tab character (U+0009). |
It's possible to construct arbitrary character classes using set operations:
[abc] | Any one of a , b , or c . (Enumeration.) |
[a-c] | Any one of a , b , or c . (Range.) |
[^abc] | Any character except a , b , or c . (Negation.) |
[[a-f][0-9]] | Any character in either range. (Union.) |
[[a-z]&&[jkl]] | Any character in both ranges. (Intersection.) |
Most of the time, the built-in character classes are more useful:
\d | Any digit character. |
\D | Any non-digit character. |
\s | Any whitespace character. |
\S | Any non-whitespace character. |
\w | Any word character. |
\W | Any non-word character. |
\p{NAME} | Any character in the class with the given NAME. |
\P{NAME} | Any character not in the named class. |
There are a variety of named classes:
Is
. For example \p{IsLu
} for all uppercase letters.
forName(String)
prefixed
by In
. For example \p{InHebrew
} for all characters in the Hebrew block.
Character
whose name starts with is
, but with the is
replaced by java
.
For example, \p{javaLowerCase
}.
Quantifiers match some number of instances of the preceding regular expression.
* | Zero or more. |
? | Zero or one. |
+ | One or more. |
{n} | Exactly n. |
{n,} | At least n. |
{n,m} | At least n but not more than m. |
Quantifiers are "greedy" by default, meaning that they will match the longest possible input
sequence. There are also non-greedy quantifiers that match the shortest possible input sequence.
They're same as the greedy ones but with a trailing ?
:
*? | Zero or more (non-greedy). |
?? | Zero or one (non-greedy). |
+? | One or more (non-greedy). |
{n}? | Exactly n (non-greedy). |
{n,}? | At least n (non-greedy). |
{n,m}? | At least n but not more than m (non-greedy). |
Quantifiers allow backtracking by default. There are also possessive quantifiers to prevent
backtracking. They're same as the greedy ones but with a trailing +
:
*+ | Zero or more (possessive). |
?+ | Zero or one (possessive). |
++ | One or more (possessive). |
{n}+ | Exactly n (possessive). |
{n,}+ | At least n (possessive). |
{n,m}+ | At least n but not more than m (possessive). |
^ | At beginning of line. |
$ | At end of line. |
\A | At beginning of input. |
\b | At word boundary. |
\B | At non-word boundary. |
\G | At end of previous match. |
\z | At end of input. |
\Z | At end of input, or before newline at end. |
Look-around assertions assert that the subpattern does (positive) or doesn't (negative) match after (look-ahead) or before (look-behind) the current position, without including the matched text in the containing match. The maximum length of possible matches for look-behind patterns must not be unbounded.
(?=a) | Zero-width positive look-ahead. |
(?!a) | Zero-width negative look-ahead. |
(?<=a) | Zero-width positive look-behind. |
(?<!a) | Zero-width negative look-behind. |
(a) | A capturing group. |
(?:a) | A non-capturing group. |
(?>a) | An independent non-capturing group. (The first match of the subgroup is the only match tried.) |
\n | The text already matched by capturing group n. |
See group()
for details of how capturing groups are numbered and accessed.
ab | Expression a followed by expression b. |
a|b | Either expression a or expression b. |
(?dimsux-dimsux:a) | Evaluates the expression a with the given flags enabled/disabled. |
(?dimsux-dimsux) | Evaluates the rest of the pattern with the given flags enabled/disabled. |
The flags are:
i | CASE_INSENSITIVE | case insensitive matching |
d | UNIX_LINES | only accept '\n' as a line terminator |
m | MULTILINE | allow ^ and $ to match beginning/end of any line |
s | DOTALL | allow . to match '\n' ("s" for "single line") |
u | UNICODE_CASE | enable Unicode case folding |
x | COMMENTS | allow whitespace and comments |
Either set of flags may be empty. For example, (?i-m)
would turn on case-insensitivity
and turn off multiline mode, (?i)
would just turn on case-insensitivity,
and (?-m)
would just turn off multiline mode.
Note that on Android, UNICODE_CASE
is always on: case-insensitive matching will
always be Unicode-aware.
There are two other flags not settable via this mechanism: CANON_EQ
and
LITERAL
. Attempts to use CANON_EQ
on Android will throw an exception.
In some cases, Android will recognize that a regular expression is a simple
special case that can be handled more efficiently. This is true of both the convenience methods
in String
and the methods in Pattern
.
Constants | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
int | CANON_EQ | This constant specifies that a character in a Pattern and a
character in the input string only match if they are canonically
equivalent. |
|||||||||
int | CASE_INSENSITIVE | This constant specifies that a Pattern is matched
case-insensitively. |
|||||||||
int | COMMENTS | This constant specifies that a Pattern may contain whitespace or
comments. |
|||||||||
int | DOTALL | This constant specifies that the '.' meta character matches arbitrary characters, including line endings, which is normally not the case. | |||||||||
int | LITERAL | This constant specifies that the whole Pattern is to be taken
literally, that is, all meta characters lose their meanings. |
|||||||||
int | MULTILINE | This constant specifies that the meta characters '^' and '$' match only the beginning and end of an input line, respectively. | |||||||||
int | UNICODE_CASE | This constant specifies that a Pattern that uses case-insensitive matching
will use Unicode case folding. |
|||||||||
int | UNIX_LINES | This constant specifies that a pattern matches Unix line endings ('\n') only against the '.', '^', and '$' meta characters. |
Public Methods | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Returns a compiled form of the given
regularExpression , as modified by the
given flags . | |||||||||||
Equivalent to
Pattern.compile(pattern, 0) . | |||||||||||
Returns the flags supplied to
compile . | |||||||||||
Returns a
Matcher for this pattern applied to the given input . | |||||||||||
Tests whether the given
regularExpression matches the given input . | |||||||||||
Returns the regular expression supplied to
compile . | |||||||||||
Quotes the given
string using "\Q" and "\E", so that all
meta-characters lose their special meaning. | |||||||||||
Equivalent to
split(input, 0) . | |||||||||||
Splits the given
input at occurrences of this pattern. | |||||||||||
Returns a string containing a concise, human-readable description of this
object.
|
Protected Methods | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Called before the object's memory is reclaimed by the VM.
|
[Expand]
Inherited Methods | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
From class java.lang.Object
|
This constant specifies that a character in a Pattern
and a
character in the input string only match if they are canonically
equivalent. It is (currently) not supported in Android.
This constant specifies that a Pattern
is matched
case-insensitively. That is, the patterns "a+" and "A+" would both match
the string "aAaAaA". See UNICODE_CASE
. Corresponds to (?i)
.
This constant specifies that a Pattern
may contain whitespace or
comments. Otherwise comments and whitespace are taken as literal
characters. Corresponds to (?x)
.
This constant specifies that the '.' meta character matches arbitrary
characters, including line endings, which is normally not the case.
Corresponds to (?s)
.
This constant specifies that the whole Pattern
is to be taken
literally, that is, all meta characters lose their meanings.
This constant specifies that the meta characters '^' and '$' match only
the beginning and end of an input line, respectively. Normally, they
match the beginning and the end of the complete input. Corresponds to (?m)
.
This constant specifies that a Pattern
that uses case-insensitive matching
will use Unicode case folding. On Android, UNICODE_CASE
is always on:
case-insensitive matching will always be Unicode-aware. If your code is intended to
be portable and uses case-insensitive matching on non-ASCII characters, you should
use this flag. Corresponds to (?u)
.
This constant specifies that a pattern matches Unix line endings ('\n')
only against the '.', '^', and '$' meta characters. Corresponds to (?d)
.
Returns a compiled form of the given regularExpression
, as modified by the
given flags
. See the flags overview for more on flags.
PatternSyntaxException | if the regular expression is syntactically incorrect. |
---|
Equivalent to Pattern.compile(pattern, 0)
.
Returns a Matcher
for this pattern applied to the given input
.
The Matcher
can be used to match the Pattern
against the
whole input, find occurrences of the Pattern
in the input, or
replace parts of the input.
Tests whether the given regularExpression
matches the given input
.
Equivalent to Pattern.compile(regularExpression).matcher(input).matches()
.
If the same regular expression is to be used for multiple operations, it may be more
efficient to reuse a compiled Pattern
.
Quotes the given string
using "\Q" and "\E", so that all
meta-characters lose their special meaning. This method correctly
escapes embedded instances of "\Q" or "\E". If the entire result
is to be passed verbatim to compile(String)
, it's usually clearer
to use the LITERAL
flag instead.
Splits the given input
at occurrences of this pattern.
If this pattern does not occur in the input, the result is an
array containing the input (converted from a CharSequence
to
a String
).
Otherwise, the limit
parameter controls the contents of the
returned array as described below.
limit | Determines the maximum number of entries in the resulting
array, and the treatment of trailing empty strings.
|
---|
Returns a string containing a concise, human-readable description of this object. Subclasses are encouraged to override this method and provide an implementation that takes into account the object's type and data. The default implementation is equivalent to the following expression:
getClass().getName() + '@' + Integer.toHexString(hashCode())
See Writing a useful toString
method
if you intend implementing your own toString
method.
Called before the object's memory is reclaimed by the VM. This can only happen once the garbage collector has detected that the object is no longer reachable by any thread of the running application.
The method can be used to free system resources or perform other cleanup
before the object is garbage collected. The default implementation of the
method is empty, which is also expected by the VM, but subclasses can
override finalize()
as required. Uncaught exceptions which are
thrown during the execution of this method cause it to terminate
immediately but are otherwise ignored.
Note that the VM does guarantee that finalize()
is called at most
once for any object, but it doesn't guarantee when (if at all) finalize()
will be called. For example, object B's finalize()
can delay the execution of object A's finalize()
method and
therefore it can delay the reclamation of A's memory. To be safe, use a
ReferenceQueue
, because it provides more control
over the way the VM deals with references during garbage collection.
Throwable |
---|