Currently showing entries with the tag: Regular Expressions

page 1 of 1
1 

Basic Regular Expressions

September 30, 2007 • 3:13PM • permalink
Originally, I considered Regular Expressions to be a bonus skill. Something that was nice if developers had it, but not a necessity. Recently it seems that things would go a lot smoother for me if the people around me knew RegEx, but I've been surprised by how few people do. It might have something to do with there being only one decent book (that I've seen) on the subject.

So, I decided to write this small tutorial to give a basic description of RegEx. I will be using .NET for the examples, but the patterns themselves should be valid in most Regular Expression implementations.

For some of our examples below we will require a larger block of example text, I'm going to use a paragraph of some random Lorem Ipsum text, with some random punctuation thrown in:


string lipsum = "Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Duis non nulla id sapien molestie pulvinar. 'Nulla vitae risus vel quam imperdiet egestas!' Vestibulum fringilla consequat pede. Quisque tortor lectus, rhoncus ut, posuere vel, rhoncus in, tellus. Fusce mi. Curabitur eget augue sit amet lorem iaculis sagittis. Nam et massa. Nunc sagittis, libero et eleifend aliquet, mi sem varius orci, sit amet sagittis turpis est nec dolor? Nulla facilisi. Proin volutpat erat a sem. Maecenas nibh libero, euismod at, consequat quis, rutrum in, turpis. Aenean erat enim, fermentum a, luctus non, bibendum in, tellus. In ac libero. Suspendisse potenti. Pellentesque tincidunt dignissim mi.";




First, a simple overview of some of the simpler RegEx constructs:

Literal Text

It's simple. Literal text matches exactly. If text is not modified by Regular Expression constructs (punctuation), then it should be taken literally. So a search for the Regex pattern adam will look for my name in a block of text.

[] - Character Classes and Ranges

Character classes are used to indicate exact character matches. For example, if I want to match all the vowels in the above text I can use:

[aeiou]

This will match exactly one character (a vowel) in any block of text that it is matched against. I'll use this pattern to illustrate the simplest way to collect all matches in a RegEx expression.


using System.Text.RegularExpressions;
//this is needed at the very top


RegexOptions options = RegexOptions.IgnorePatternWhitespace;
options |= RegexOptions.IgnoreCase;

Regex regex = new Regex("[aeiou]", options);
MatchCollection mc = regex.Matches(lipsum);



(Please note that to conserve space, I won't repeat the calls to add the RegexOptions, but you can assume that IgnorePatternWhitespace and RegexOptions.IgnoreCase were used in each example.)

After the snippet is run, the MatchCollection object, mc, holds all the vowel matches in the Lorem ipsum text (o, e, i, u, o, o, i and so on). Note that without the RegexOptions.IgnoreCase option, our pattern would need to be [AEIOUaeiou] in order to match the uppercase letters as well.

Character classes can also be used to contain ranges. To match against any letter in the alphabet, the following character class can be used:

[A-Za-z]

When matched against the lipsum text, this will match once for each single letter. Some other example character classes are:

[0-9] Any number

[a-ep-z] Any letter between 'a' and 'e' (inclusive) or between 'p' and 'z' (inclusive)

[A-Za-z0-9] Any letter or number


Note that the above will literally match one single character each time. If we need to match more characters, we can use additional aspects of Regex to indicate that.

First, we will look at the * modifier. This indicates that the preceding match component should be matched zero or more times. (The zero is important because it means that empty strings will match as well.) For example:

The regular expression [a-z]* will match all of the following:

a
d
adam
lorem
abcdefghijklmnopqrstuvwxyz

We can change the * into a + to match the preceding component one or more times. That is the difference between the * and the +. The * can have 0 matches and still satisfy the regular expression, while the + requires at least one physical match to be considered a valid match. For example, in the lipsum text above, the pattern [0-9]* would match, but [0-9]+ wouldn't, since the former is matched by empty strings. It is also important to note thatm by default, regular expressions are greedy and try to match as many characters as possible. Both the * and the + will include as many characters in their matches as they can.

Additionally, we can also use the ? which will match zero or one of the preceding pattern, essentially making it an "optional match". So:

L?orem?

will match any of the following:

Lorem
Bored
ore


Additional Modifiers

^ (Caret)

This modifier can be used in two very different ways. The first way is at the beginning of the inside of a character class. If present, the caret negates the meaning of the character class and instead matches ANY character except those inside of the brackets.

If we match the above lipsum text against the pattern [^sjdhflo ]+ (note that it includes the space) we are matching against one or more characters in a row that are not s, j, d, h, f, l, o or space.

If we actually match the above pattern we see many results (239 in all), such as rem, ip, um and so on.

^ (Caret) Part 2

The ^ can also be used outside of character classes, but only at the very beginning of a RegEx pattern. When present, it anchors the pattern to the start of a line of text. This is either the start of a string or after each hard line break. So, in the above lipsum text, the pattern ^lorem would only return one match, even though the word appears in the text twice. (Note that because the above does not have any hard line breaks - only soft line breaks caused by the formatting of the webpage - the ^ will only match at the beginning of the entire block of text.)

$

The $ is used only at the end of a RegEx pattern, in a use that is opposite that of the ^ shown above. The $ indicates that pattern is anchored to the end of a line of text. See the following example:


string s = "Sally sells seashells by the sea";
Regex regex = new Regex("sea$");
MatchCollection mc = regex.Matches(s);



In the above code, mc will only contain one match. This will be the one at the end, since it has the $. If we remove the $, it will instead have two matches. Like so:


string s = "Sally sells seashells by the sea";
Regex regex = new Regex("sea");
MatchCollection mc = regex.Matches(s);



Note that if we change the original string to have proper punctuation at the end (the period) the original sea$ pattern will not match at all.


string s = "Sally sells seashells by the sea.";
Regex regex = new Regex("sea$");
MatchCollection mc = regex.Matches(s);
//mc.Count is now 0!!!



This is because the end of pattern is looking for the word sea at the end of the block of text. Because the period is there and we're not looking for it, our pattern won't match!

There are a number of ways to fix this problem, but a general punctuation match will do the trick.


string s = "Sally sells seashells by the sea.";
Regex regex = new Regex("sea[!?.,-]$", options);
MatchCollection mc = regex.Matches(s);



Note the placement of the hyphen (-) at the end of the character class so that it isn't accidentally misinterpreted as a range of characters.

Note also the caveat that although the period outside of the character class WOULD match, it wouldn't do exactly what you think. Can you guess what the following example will print?


string s = "Sally sells seashells by the seaX";
Regex regex = new Regex("sea.$", options);
MatchCollection mc = regex.Matches(s);

if (mc.Count == 0)
   Response.Write("0 matches!");
else if (mc.Count == 1)
   Response.Write("1 match!");



If you guessed "1 match!", you're correct - but do you know why? The answer lies in our next modifier.

. (Period)

The . is used to represent ANY character, but at least one character must be in the matching position. It's really just a placeholder to say that "something" has to be in the position indicated.

In the example given above, the X fills the position that the . is in, so we get our "1 match!".


Escaped Characters and Shortcuts

The \ backslash is an extremely powerful character in RegEx pattern matching with a plethora of uses. First, it is used to escape special characters to be taken as literals within Regular Expression patterns. For example, if we were to search for text within parenthesis, we would need to escape the parenthesis, since they are used within RegEx patterns to form groups, as I'll explain below.

To search for the word Trunks in parenthesis the pattern would be:

\(Trunks\)

Please note in the following example which shows a common problem for beginners using RegEx in .NET:


string s = "This sentence is about my cat (Trunks).";
Regex regex = new Regex("\\(Trunks\\)");
MatchCollection mc = regex.Matches(s);



Note that in the above code I used two backslashes before the parenthesis. Can you figure out why? Here's an alternative way of writing it which might give you a hint:


Regex regex = new Regex(@"\(Trunks\)");



In .NET, as well as many other languages, characters within strings can be escaped by the backslash character. \n is the newline, \t is the tab, \r is the carriage return and so on. The .NET string that you're using for the RegEx pattern is first interpreted by the .NET parser, so any escaped characters will already have been escaped by the time the RegEx parser takes control and the pattern will not match correctly.

In order to get around that problem, we first escape the backslash itself, by using a \\ construct. As mentioned above (and below) the parenthesis is used as a special character in RegEx and needs to be escaped if you intend to use it as a literal.

In the pattern \\(Trunks\\), the backslash is first escaped as a .NET string so that the RegEx parser actually sees \(Trunks\). This escapes the parenthesis in the RegEx parsing and correctly finds the parenthesis. As an additional note, if you're unclear as to why the @"\(Trunks\)" works, it is a C# verbatim string. In a verbatim string, all characters are automatically escaped (except for the double-quote character, which you represent by doubling up "") and you can include formatting like tabs and newlines.

Please remember, this is for our .NET example and won't apply to all languages. If you're not using .NET, check with your language reference to see if you need to escape the backslashes.

RegEx Escaped Characters

RegEx has its own escape strings that are mostly used as shortcuts for a large ranges of characters. Like many other languages, you escape RegEx characters by using a backslash.

Note that in .NET you need to take the above caveat into considering when using these, so, for example, to search for a literal backslash, you would need to escape it in both .NET and RegEx: \\\\. This is because first .NET will escape it into \\, then RegEx will escape it into \ and search for the literal value.

Here are a few examples:

\d The same thing as [0-9].

\w The same thing as [a-zA-Z_0-9].

\s This will match against any whitespace.

The capital versions offer negations...

\D The same thing as [^0-9].

\W The same thing as [^a-zA-Z_0-9].

\S This will match against any character EXCEPT whitespace.

There are many other, less used, escaped characters. Be sure to check with your RegEx implementation's documentation (click here for the .NET resource.)

Note again that the backslash should be double-escaped when necessary.


Grouping

In the teaching of Regular Expressions, Grouping is usually considered an advanced topic and not taught in a first lesson. Personally, I don't see the point of learning how to use RegEx unless you can use it!

A group is automatically formed everytime matching pairs of parenthesis are used in a pattern. Each match has an automatic group of the entire match and then each subsequent parenthesis pair, as shown in the following example:


string s = "Mississippi";
Regex regex = new Regex("([aeiou][s]+)");
MatchCollection mc = regex.Matches(s);



The MatchCollection.Count property would be 2, indicating that the pattern matched twice (on iss both times, since we're matching any vowel followed by the letter s one or more times). If we examine the MatchCollection, we see it is made of Match objects, with a Groups property, another collection.

In all matches, the first Group (mc[0].Groups[0]) always contains the entire match, in this case iss. Actually, in this case, the second Group (mc[0].Groups[1]) also contains iss. This is because we're grouping the entire match and the results of our match was iss. You would see the exact same results in the second Match object since iss appears twice in the word Mississippi. Hence both mc[1].Groups[0] and mc[1].Groups[1] would contain the string iss.

If we change the parenthesis slightly and only group the vowel:


string s = "Mississippi";
Regex regex = new Regex("([aeiou])[s]+");
MatchCollection mc = regex.Matches(s);



The MatchCollection.Count property would still be 2 with the same exact matches. Also, the Groups[0] property would still contain iss, since that's our entire match. However, the Groups[1].Value would now contain only the i, since that's the entire match inside our parenthesis.

We'll change it slightly one more time:


string s = "Mississippi";
Regex regex = new Regex("([aeiou])(s)s");
MatchCollection mc = regex.Matches(s);



Our pattern is trying to match a vowel followed by two letter s characters. We are grouping the vowel by itself and additionally grouping the first of two s characters.

When we run the above code we get two matches, like we expect to. Each one matches against iss. If we examine the Groups collection, we see that Groups[1] contains the i and Groups[2] contains the first s.

The Groups property is also helpful if you want to modify text in the original string that you've located using a Regular Expression.

In the below example, I'm going to very simple replace every instance of the pattern r[aeiou]+m with the word BLAH. The pattern will match one or more vowels between the letters r and m. (If you're unclear as to why, re-read the sections above!)


string l = lipsum;
Regex regex = new Regex("r[aeiou]+m");

Match m = regex.Match(lipsum);
while (m.Success)
{
   l = l.Substring(0, m.Index) + "BLAH" + l.Substring(m.Index + m.Value.Length);
   m = regex.Match(l);
}



We keep looping as long as a match is found and use the Index property of the Match object to determine where in the original string our match was found. We remove the match and replace it with BLAH.


The resulting string is:

LoBLAH ipsum dolor sit amet, consectetuer adipiscing elit. Duis non nulla id sapien molestie pulvinar. 'Nulla vitae risus vel quam imperdiet egestas!' Vestibulum fringilla consequat pede. Quisque tortor lectus, rhoncus ut, posuere vel, rhoncus in, tellus. Fusce mi. Curabitur eget augue sit amet loBLAH iaculis sagittis. Nam et massa. Nunc sagittis, libero et eleifend aliquet, mi sem varius orci, sit amet sagittis turpis est nec dolor? Nulla facilisi. Proin volutpat erat a sem. Maecenas nibh libero, euismod at, consequat quis, rutBLAH in, turpis. Aenean erat enim, fermentum a, luctus non, bibendum in, tellus. In ac libero. Suspendisse potenti. Pellentesque tincidunt dignissim mi.

Which you can compare with the original, here:

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Duis non nulla id sapien molestie pulvinar. 'Nulla vitae risus vel quam imperdiet egestas!' Vestibulum fringilla consequat pede. Quisque tortor lectus, rhoncus ut, posuere vel, rhoncus in, tellus. Fusce mi. Curabitur eget augue sit amet lorem iaculis sagittis. Nam et massa. Nunc sagittis, libero et eleifend aliquet, mi sem varius orci, sit amet sagittis turpis est nec dolor? Nulla facilisi. Proin volutpat erat a sem. Maecenas nibh libero, euismod at, consequat quis, rutrum in, turpis. Aenean erat enim, fermentum a, luctus non, bibendum in, tellus. In ac libero. Suspendisse potenti. Pellentesque tincidunt dignissim mi.


That's quite a lot to take in for one entry on RegEx and should definitely get any novices a giant step closer to Regular Expression mastery! Look for an advanced discussion (including some .NET specific RegEx constructs) in the future.





page 1 of 1
1 




Tags

SQL Server 2000 Win32 API launch utility PC OS software driver protocol performance mathematics web development T-SQL csharp Microsoft Adam Microsoft Windows Adrianne Remote Desktop assembly PHP lazy initialization Windows internals syntax book review VB SQL Server 2005 books AdSense type bitwise server ASP concurrency help dotnet love AnimeConPics reflection module internals Introduction API query expert interface Regular Expressions AlternativeNicheNetwork anime convention Windows