Regular expressions are a way to search for substrings ("matches") in strings. This is done by searching with "patterns" through the string.
You probably know the '*' and '?' charachters used in the dir
command on the DOS command line. The '*' character means "zero or
more arbitrary characters" and the '?' means "one arbitrary
character".
When using a pattern like "text?.*", it will find files like
textf.txt
text1.asp
text9.htmlBut it will not find files like
text.txt
text.asp
text.htmlThis is exactly the way regular expressions work. While the '*' and '?'
are a very limited subset of patterns, regular expressions supply a much broader spectrum of
describing patterns.
Example usages could be:
Any operator or set of operators represent a pattern.
You will probably need to match some patterns containing symbols that may differ and
vary in some way. For example, you want to find words starting with tom
and having four characters in length. The operator that matches any character is dot (.).
Thus, the following pattern would match all these words: tom.
This example will also find text like tom., tom>, tom!,
etc.
To prevent the pattern tom. from matching not meaningful phrases, we
should narrow the search criteria to only alphabetic symbols. This can be done using
character sets. A set is specified with square brackets. Sets may include individual
symbols and ranges. For example, the following set will match any one symbol of a, t, z
and 8: [atz8]. And this set will match all lowercase letters: [a-z].
Thus, to limit the previous example to meaningful phrases, we could write a pattern:
tom[a-z].
Sometimes you need to find all symbols except some. Writing a large set including
all possible symbols is ineffective. So we better use a negation operator in a set: ^.
For example, the following set will match any one symbol except @: [^\@].
Please note that the symbol @ is escaped as it is not alphanumeric.
Regular expressions would be of no use unless they might match any text of any length. To achieve this, repetition qualifiers were introduced, which allows matching nearly any text.
In the previous example, a pattern tom[a-z] would successfully find any
word of four symbols in length except tom itself. To force the pattern to
match tom, we should instruct it to do so. The qualifier ?
tells to match the preceding pattern 0 or 1 times. The following pattern will match tom
as well: tom[a-z]?
Before we proceed with the other repetition qualifiers, we should understand one important thing about repetition modes.
Imagine a text that contains some occurrences of a character. For example, one,
two, three, four. This text has 3 entries of a comma. Now we want to instruct the
regular expression engine to "match all characters but stop before a comma".
A greedy mode will match all characters and stop before the last comma:
one, two, three, four.
A non-greedy mode will match all characters and stop before the first comma:
one, two, three, four.
Let us extend the previous example by introducing a new condition: match all text
starting from tom but ending with full-stop. So we need to:
tom;The following table shows the corresponding operators:
| Part | Operator | Comment |
|---|---|---|
Match tom |
tom | A simple text |
| Match any character | . | A dot-operator |
| Repeat the preceding condition until the first occurrence of the next match is found | @ | Repeat qualifier: Match previous pattern 0 or more times (non-greedy). |
| Match a full-stop (a dot) | \. | A dot. Escape is added to instruct to process the dot as a common symbol, not operator. |
Thus, the pattern would look like:
tom.@\.
Say you need to find one of the words: macrocoding and macrocode. There are several ways to do that. For example, we can split each word into macrocod+ing and macrocod+e. Now, we will need a pattern that would:
macrocod;ing or e.When we say "or", we say "or". When a regular expression says
"or", it says "|". Armed with this knowledge, we write: macrocoding|e.
Looks rather meaningless, doesn't it? What would this expression do: match macrocoding
or e or match macrocodin and g or e ? That's why a pattern
operator had been developed.
A pattern operator concatenates several stand-alone symbols or patterns to
form one pattern. For example, a single symbol e is a pattern. The first
symbol (i) in the "ing" is a stand-alone pattern.
To form a single pattern from "ing", we should enclose it in braces:
{ing}
Now, ing is a single pattern.
This allows us to write the following pattern:
{macrocod{ing}|{e}}
This is a correct well-formed single pattern.
In terms of semantics, expression and pattern operators are the same. The difference is that the text that matches the expression is stored and can be referenced further, for example, when replacing.
For example, we could alter the previous example to make an expression out of the ending
{ing}|{e} by enclosing it in the round braces:
{macrocod({ing}|{e})}
Now we can reference the ending with the operator \1. 1 stands for the
number of the expression. We can write the replace pattern that would insert a plus
sign between macrocod and the ending:
macrocod+\1
| © 2002-2006 | This help file was built with |