[ previous ]
[ Contents ]
[ 1 ]
[ 2 ]
[ 3 ]
[ 4 ]
[ 5 ]
[ 6 ]
[ 7 ]
[ 8 ]
[ 9 ]
[ 10 ]
[ 11 ]
[ 12 ]
[ 13 ]
[ 14 ]
[ 15 ]
[ 16 ]
[ 17 ]
[ 18 ]
[ A ]
[ B ]
[ C ]
[ D ]
[ next ]
Debian Tutorial
Chapter 11 - Text tools
head, tail, grep, wc, tr, sed, perl and so on
11.1 Regular expressions
A regular expression is a description of a set of characters. This description
can be used to search through a file by looking for text that matches
the regular expression. Regular expressions are analagous to shell wildcards
(see Filename expansion
("Wildcards"), Section 6.6), but they are both more complicated
and more powerful.
A regular expression is made up of text and metacharacters. A
metacharacter is just a character with a special meaning. Metacharacters
include: . * [] - \ ^ $.
If a regular expression contains only text (no metacharacters), then it matches
that text. For example, the regular expression 'my regular
expression' matches the text 'my regular expression', and
nothing else. Regular expressions are usually case-sensitive.
You can use the egrep command to display all lines in a file which
contain a regular expression. Its syntax is:
egrep 'regexp' filename1 ... [16]
For example, to find all lines in the GPL which contain the word GNU, you type:
egrep 'GNU' /usr/doc/copyright/GPL
egrep will print the lines to standard output.
If you want all lines which contain freedom, followed by some
indeterminate text, followed by GNU, you can do:
egrep 'freedom.*GNU' /usr/doc/copyright/GPL
The . means "any character"; the * means
"zero or more of the preceding thing," in this case "zero or
more of any character." So .* matches pretty much any text at
all. egrep only matches on a line-by-line basis, so
freedom and GNU have to be on the same line.
Here's a summary of regular expression metacharacters:
- .
-
Matches any single character except newline.
- *
-
Matches zero or more occurences of the preceding thing. So the expression
a* matches 0 or more lowercase a, and .*
matches zero or more characters.
- [characters]
-
The brackets must contain one or more characters; the whole bracketed
expression matches exactly one character out of the set. So [abc]
matches one a, one b, or one c; it does
not match 0 characters, and it does not match a character other than these
three.
- ^
-
Anchors your search at the beginning of the line. The expression
^The matches The only at the beginning of a line;
there can't be spaces or other text before The. If you want to
allow spaces, you can permit 0 or more space characters like this: ^
*The.
- $
-
Anchors at the end of the line. end$ requires the text
end to be at the end of the line, with no intervening spaces or
text.
- [^characters]
-
^ reverses the sense of a bracketed character list. So
[^abc] matches any single character, except a, b, or c.
- [character-character]
-
You can include ranges in a bracketed character list. To match any lowercase
letter, use [a-z]. You can have more than one range; so to match
the first three or last three letters of the alphabet, try
[a-cx-z]. To get any letter, any case, try [a-zA-Z].
You can mix ranges with single characters and with the ^
metacharacter; for example, [^a-zBZ] means "anything except a
lowercase letter, capital B, or capital Z."
- ()
-
You can use parentheses to group parts of the regular expression, just as you
do in a mathematical expression
- |
-
| means "or" --- you can use it to provide a series of
alternative expressions. Usually you want to put the alternatives in
parentheses, like this: c(ad|ab|at) matches cad or
cab or cat. Without the parentheses, it would match
cad or ab or at instead
- \
-
Escapes any special characters; if you want to find a literal *,
you type \*. The slash means to ignore *'s usual
special meaning.
Here are some more examples, to help you get a feel for things:
- c.pe
-
matches cope, cape, caper
- c\.pe
-
matches c.pe, c.per
- sto*p
-
matches stp, stop, stoop
- car.*n
-
matches carton, cartoon, carmen
- xyz.*
-
matches xyz and anything after it; some tools, like
egrep, only match until the end of the line.
- ^The
-
matches The at the beginning of a line
- atime$
-
matches atime at the end of a line
- ^Only$
-
matches a line which consists solely of the word Only --- no
spaces, no other characters, nothing. Only Only is allowed
- b[aou]rn
-
matches barn, born, burn
- Ver[D-F]
-
matches VerD, VerE, VerF
- Ver[^0-9]
-
matches Ver followed by any non-digit
- the[ir][re]
-
matches their, therr, there,
theie
- [A-Za-z][A-Za-z]*
-
matches any word which consists of only letters, and at least one letter. Will
not match numbers or spaces
[ previous ]
[ Contents ]
[ 1 ]
[ 2 ]
[ 3 ]
[ 4 ]
[ 5 ]
[ 6 ]
[ 7 ]
[ 8 ]
[ 9 ]
[ 10 ]
[ 11 ]
[ 12 ]
[ 13 ]
[ 14 ]
[ 15 ]
[ 16 ]
[ 17 ]
[ 18 ]
[ A ]
[ B ]
[ C ]
[ D ]
[ next ]
Debian Tutorial
17 June 2006
Havoc Pennington [email protected]