Hour 9: Regular Expressions: Regular Expressions--Basics

In the previous section you learned enough about regular expressions to use them as simple patterns. This should be enough in most cases. However at some point you will get into a situation where you need more power. For this sake, here comes the whole story about regular expressions and what you can do with them.

Regular expressions describe characteristics which must be fulfilled when you, for example, search for a string. If this sounds abstract for you, please think about how you would search for words containing exactly eight letters. You would not write down all possible combinations of eight letters and search for them one at a time, would you? I guess not; you would most certainly do something a bit smarter. What you would do is this: Search for words that have the characteristic that they have eight letters .

Likewise if you knew that, somewhere in your text, you have a word starting in the, you would not do an ordinary search for the, because this would match all kind of words that contain the substring the, such as aesthetic, farther, and smoothed. No, you would search for words having the characteristic that they start with the letters the.

There are a number of characters that have special meanings in regular expressions. These include $, ^, ., *, +, ?, [, ], and \. If you want to use one of these characters without its special meaning, you have to prefix it with a backslash. When you prefix a special character with a backslash, you are said to escape it. The simplest regular expression is one without any of these special characters (or with the given characters escaped). Therefore the following strings are not special regular expressions (that is, they have no other meaning than text):

Yes it works!		This matches the string "Yes it works!" No special characters are in this string.
Don't Leave		Again, no special characters.
100\$\+200\$=\?		This matches the string 100$+200$=?. All the special characters ($, +, and ?) have been escaped.
Enough\.		This matches the string Enough.. The dot, which has a special regular expression meaning, has been escaped.
\\\$		This matches the text \$. First, the backslash has been escaped. Next, the dollar sign has been escaped, which result in a backslash and a dollar sign.

Repeating Elements

Regular expressions are built using a number of special characters that add special meaning to the string. The first element of these characters is the asterisk (*). This one is not new to you if you know it from the shell, but be warned it is not entirely like the asterisk from the shell.

The asterisk must be used with another regular expression and means repeat the other regular expression a number of times (even zero times). The string a is a regular expression that matches a single a. The regular expression a* matches, therefore, the empty string, the string a, the string aa, the string aaa, and so on.

Combining Regular Expressions

Until now, the regular expressions you have seen have been very dull. The reason for this is that you still haven't learned how to combine two regular expressions.

If you have two regular expressions A and B, the concatenated regular expression AB means that a given text should match first A then B. If you, for example, have a regular expression a*, this regular expression matches a number of a's, and likewise b* matches a number of b's. Therefore a*b* would match every string which starts with any number of a's, and ends in any number of b's, with nothing in between: ab, aaa, bbb, aabbbbbb, and so on.

You can likewise build a regular expression that matches either A or B, namely the expression A\|B (that is the regular expression A concatenated with a backslash, and a pipe symbol concatenated with the regular expression B). The regular expression a*\|b* matches therefore a number of a's or a number of b's but not mixed: a, b, aa, bb, and so on.

You might now very well ask, How do I create a regular expression that matches a number of a's and b's mixed together, as in the example baaabbaabb? The regular expression a\|b matches an a or a b. Does a\|b* match a number of a's mixed with b's, then? If you think so, please let me know what a regular expression, which matches one a or a number of b's look like. (You will most likely suggest the same regular expression, namely a\|b*.)

If you are a computer geek you should immediately recognize the preceding problem as a matter of precedence. If on the other hand you are not a computer geek, I can tell you that the preceding problem is equivalent to the problem of telling whether the mathematical expression 3 + 4 * 5 gives the result 23 or 35 (The result is 23!). In mathematics the rule is that you must evaluate * before you evaluate +. Thus a subresult in the expression preceding is 3 + 20 (and not 7 * 5. If you are bad at remembering which one is evaluated first, you can always use parentheses to indicate your intention. Thus 3 + 4 * 5 is equal to 3 + (4 * 5). Likewise in regular expressions, you can use parentheses to indicate which group a given operator works on. Grouping in regular expressions is done by surrounding the group with $ and $.

The regular expression $a\|b$*, therefore, means a number of elements that can be matched by the regular expression a\|b.

It is beyond the scope of this book to tell you the whole truth about the rules for when it is necessary to use parentheses and when it is not. A general rule of thumb that you can use is that *, +, and ? need their arguments enclosed in parentheses unless the argument is a single letter. Thus it is not necessary to use parentheses in a*, but if you need to match a number of abc's--that is, abc, abcabc, abcabcabc, and so on--you need parentheses, as in the regular expression $abc$+. If it said abc+, it would mean an a, a b, and one or more c's.

Hey! Now you have in fact learned a lot about regular expressions! Before I continue, Figures 9.1 and 9.2 show you a few regular expressions together with explanations, so you can get a feel for it.

Figure 9.1
A regular expression that can be used to match section headings in LaTeX.

Figure 9.2
A regular expression that can be used to match letters sent to or from a .com email address.

All in all, the regular expression in Figure 9.2 matches any text starting with From:, To:, or Sender:, and ending in .com, with any text in between.

Character Groups

There is one final construct you have to learn before you know every possible way of constructing text for matches. That is the characters group.

If you know the [...] construct from the shell, you are lucky, because that is exactly the same! If you don't, keep on reading.

Character groups is a way of defining a group of possible characters. The simplest character group is one where you list a number of characters. An example of this is [abcde]. This group matches any of the characters a, b, c, d, or e.

Within character groups, you can also define a range of characters, such as [0-9], that match a single digit. Multiple ranges can be specified and mixed with single letters such as [a-zA-Z_], which matches any letter or an underscore.

Character groups can be negated; that is, you can specify a range that it might not match. To do this, include a caret as the first character after the opening bracket. An example of this is [^a-zA-Z_], which matches any character that is neither a letter nor an underscore.

To use a caret in the group as a character to match, put it somewhere that is not the first position. To include a dash in the list, put it before the closing bracket. Finally, to include an ending square bracket, put it next to the opening one, such as []A-Z], which matches either a closing bracket or a capitalized letter.

The regular expression in Figure 9.3 matches an assignment (or at least part of one) in Pascal. The focus here is that variables in Pascal can include any letter including underscores and numbers, with the exception that a letter must be the first character of the word.

Matching Positions

So far the regular expressions have been concerned with matching text, but you are also often interested in specifying an anchor for this text one way or another. The anchors include

Some functions that use regular expressions match the regular expression on a string, whereas others match regular expression on part of a buffer. Matching a header field from an email letter can be an example of the first kind, whereas regular-expression search is an example of the second kind.

The characters ^ and $ anchor the rest of the regular expression, respectively, to the beginning or the end of the string being matched or, in case the operation works on a buffer, to line-start or line-end.

Thus the regular expression ^a*$ matches a string only if the string contains only a's. If it is used in regular-expression-search, it matches only lines that contain only a's (or, of course, the empty line).

There also are two other regular expressions that match a location, namely \< and \>, that matches, respectively, the beginning of a word or the end of a word. A letter of a word is, in this context, defined as the regular expression [a-zA-Z0-9] or, in other words, ordinary letters from the alphabet and a digit.

The regular expression \<search matches any word that starts with search, such as search, searching, and searches. The regular expression search\> matches any words that end in search, such as search and research.

Finally there are two expressions that match, respectively, a single word character or a single nonword character. These are \w for word characters and \W for nonword characters. These regular expressions are equal to [a-zA-Z0-9] and [^a-zA-Z0-9], respectively.

Regular Expression in Your .emacs File

Regular expressions can also be used to configure options. An example is the variable auto-mode-alist, which describes which major mode to load depending on the filename. The filename is described using a regular expression. These configurations are often located in your .emacs file.

If you insert a regular expression into your .emacs file, or if you configure one using the customize library described in Hour 10, "The Emacs Help System and Configuration System," you need to escape all the backslashes. If you, for example, want to use the regular expression ^\.emacs$ to uniquely describe the file .emacs, the regular expression that you must use in the .emacs file is ^\\.emacs$.

An alternative to literally inserting the regular expression into your .emacs file is to use the sregex.el library located on the CD. This library lets you write your regular expression in a more readable, but cumbersome, way.