The Regular Expressions Primer is a tutorial for those completely new to regular expressions. To familiarize you with regular expressions, this primer starts with the simple building blocks of the syntax and through examples, builds to construct expressions useful for solving real every-day problems including searching for and replacing text.
A regular expression is often called a "regex", "rx" or "re". This primer uses the terms "regular expression" and "regex".
Unless otherwise stated, the examples in this primer are generic, and will apply to most programming languages and tools. However, each language and tool has it's own implementation of regular expressions, so quoting conventions, metacharacters, special sequences, and modifiers may vary (e.g. Perl, Python, grep, sed, and Vi have slight variations on standard regex syntax). Consult the regular expression documentation for your language or application for details.
Regular expressions are a syntactical shorthand for describing patterns. They are used to find text that matches a pattern, and to replace matched strings with other strings. They can be used to parse files and other input, or to provide a powerful way to search and replace. Here's a short example in Python:
import re n = re.compile(r'\bw[a-z]*', re.IGNORECASE) print n.findall('will match all words beginning with the letter w.')
Here's a more advanced regular expression from the Python Tutorial:
# Generate statement parsing regexes. stmts = ['#\s*(?P<op>if|elif|ifdef|ifndef)\s+(?P<expr>.*?)', '#\s*(?P<op>else|endif)', '#\s*(?P<op>error)\s+(?P<error>.*?)', '#\s*(?P<op>define)\s+(?P<var>[^\s]*?)(\s+(?P<val>.+?))?', '#\s*(?P<op>undef)\s+(?P<var>[^\s]*?)'] patterns = ['^\s*%s\s*%s\s*%s\s*$' % (re.escape(cg[0]), stmt, re.escape(cg[1])) for cg in cgs for stmt in stmts] stmtRes = [re.compile(p) for p in patterns]
Komodo can accept Python syntax regular expressions in it's various Search features.
Komodo IDE's Rx Toolkit can help you build and test regular expressions. See Using Rx Toolkit for more information.
Regular expressions can be used to find a particular pattern, or to find a pattern and replace it with something else (substitution). Since the syntax is same for the "find" part of the regex, we'll start with matching.
The simplest type of regex is a literal match. Letters, numbers and most symbols in the expression will match themselves in the the text being searched; an "a" matches an "a", "cat" matches "cat", "123" matches "123" and so on. For example:
Example: Search for the string "at".
at
at
it a-t At
Note: Regular expressions are case sensitive unless a modifier is used .
Regex characters that perform a special function instead of matching themselves literally are called "metacharacters". One such metacharacter is the dot ".", or wildcard. When used in a regular expression, "." can match any single character.
Using "." to match any character.
Example: Using '.' to find certain types of words.
t...s
trees trams teens
trucks trains beans
Many non-alphanumeric characters, like the "." mentioned above, are treated as special characters with specific functions in regular expressions. These special characters are called metacharacters. To search for a literal occurence of a metacharacter (i.e. ignoring its special regex attribute), precede it with a backslash "\". For example:
c:\\readme\.txt
c:\readme.txt
c:\\readme.txt c:\readme_txt
Precede the following metacharacters with a backslash "\" to search for them as literal characters:
^ $ + * ? . | ( ) { } [ ] \
These metacharacters take on a special function (covered below) unless they are escaped. Conversely, some characters take on special functions (i.e. become metacharacters) when they are preceeded by a backslash (e.g. "\d" for "any digit" or "\n" for "newline"). These special sequences vary from language to language; consult your language documentation for a comprehensive list.
Quantifiers specify how many instances of the preceeding element (which can be a character or a group) must appear in order to match.
The "?" matches 0 or 1 instances of the previous element. In other words, it makes the element optional; it can be present, but it doesn't have to be. For example:
colou?r
colour color
colouur colur
The "*" matches 0 or more instances of the previous element. For example:
www\.my.*\.com
www.my.com www.mypage.com www.mysite.com then text with spaces ftp.example.com
www.oursite.com mypage.com
As the third match illustrates, using ".*" can be dangerous.
It will match any number of any character
(including spaces and non alphanumeric characters). The
quantifier is "greedy" and will match as much text as possible.
To make a quantifier "non-greedy" (matching as few characters as
possible), add a "?" after the "*". Applied to the example above,
the expression "www\.my.*?\.com
" would match just
"www.mysite.com
", not the longer string.
The "+" matches 1 or more instances of the previous element. Like "*", it is greedy and will match as much as possible unless it is followed by a "?".
bob5+@foo\.com
bob5@foo.com bob5555@foo.com
bob@foo.com bob65555@foo.com
To match a character a specific number of times, add that number enclosed in curly braces after the element. For example:
w{3}\.mydomain\.com
www.mydomain.com
web.mydomain.com w3.mydomain.com
To specify the minimum number of matches to find and the maximum number of matches to allow, use a number range inside curly braces. For example:
60{3,5} years
6000 years 60000 years 600000 years
60 years 6000000 years
Quantifier | Description |
? | Matches any preceding element 0 or 1 times. |
* | Matches the preceding element 0 or more times. |
+ | Matches the preceding element 1 or more times. |
{num} | Matches the preceding element num times. |
{min, max} | Matches the preceding element at least min times, but not more than max times. |
The vertical bar "|" is used to represent an "OR" condition. Use it to separate alternate patterns or characters for matching. For example:
perl|python
perl python
ruby
Parentheses "()" are used to group characters and expressions within larger, more complex regular expressions. Quantifiers that immediately follow the group apply to the whole group. For example:
(abc){2,3}
abcabc abcabcabc
abc abccc
Groups can be used in conjunction with alternation. For example:
gr(a|e)y
gray grey
graey
Strings that match these groups are stored, or "delimited", for use in substitutions or subsequent statements. The first group is stored in the metacharacter "\1", the second in "\2" and so on. For example:
(.{2,5}) (.{2,8}) <\1_\2@example\.com>
Joe Smith <Joe_Smith@example.com> jane doe <jane_doe@example.com> 459 33154 <459_33154@example.com>
john doe <doe_john@example.com> Jane Doe <janie88@example.com>
Character classes indicate a set of characters to match. Enclosing a set of characters in square brackets "[...]" means "match any one of these characters". For example:
[cbe]at
cat bat eat
sat beat
Since a character class on it's own only applies to one character in the match, combine it with a quantifier to search for multiple instances of the class. For example:
[0123456789]{3}
123 999 376
W3C 2_4
If we were to try the same thing with letters, we would have to enter all 26 letters in upper and lower case. Fortunately, we can specify a range instead using a hyphen. For example:
[a-zA-Z]{4}
Perl ruby SETL
1234 AT&T
Most languages have special patterns for representing the most commonly used character classes. For example, Python uses "\d" to represent any digit (same as "[0-9]") and "\w" to represent any alphanumeric, or "word" character (same as "[a-zA-Z_]"). See your language documentation for the special sequences applicable to the language you use.
To define a group of characters you do not want to match, use a negated character class. Adding a caret "^" to the beginning of the character class (i.e. [^...]) means "match any character except these". For example:
[^a-zA-Z]{4}
1234 $.25 #77;
Perl AT&T
Anchors are used to specify where in a string or line to look for a match. The "^" metacharacter (when not used at the beginning of a negated character class) specifies the beginning of the string or line:
^From: root@server.*
From: root@server.example.com
I got this From: root@server.example.com yesterday >> From: root@server.example.com
The "$" metacharacter specifies the end of a string or line:
.*\/index.php$
www.example.org/index.php the file is /tmp/index.php
www.example.org/index.php?id=245 www.example.org/index.php4
Sometimes it's useful to anchor both the beginning and end of a regular expression. This not only makes the expression more specific, it often improves the performance of the search.
^To: .*example.org$
To: feedback@example.org To: hr@example.net, qa@example.org
To: qa@example.org, hr@example.net Send a Message To: example.org
Regular expressions can be used as a "search and replace" tool. This aspect of regex use is known as substitution.
There are many variations in substitution syntax depending on the language used. This primer uses the "/search/replacement/modifier" convention used in Perl. In simple substitutions, the "search" text will be a regex like the ones we've examined above, and the "replace" value will be a string:
For example, to earch for an old domain name and replace it with the new domain name:
s/http:\/\/www\.old-domain\.com/http://www.new-domain.com/
http://www.old-domain.com
http://www.new-domain.com
Notice that the "/" and "." characters are not escaped in the replacement string. In replacement strings, they do not need to be. In fact, if you were to preceed them with backslashes, they would appear in the substitution literally (i.e. "http:\/\/www\.new-domain\.com").
The one way you can use the backslash "\" is to put saved matches in the substitution using "\num". For example:
s/(ftp|http):\/\/old-domain\.(com|net|org)/\1://new-domain.\2/
http://old-domain.com
http://new-domain.com
This regex will actually match a number of URLs other than "http://old-domain.com". If we had a list of URLs with various permutations, we could replace all of them with related versions of the new domain name (e.g. "ftp://old-domain.net" would become "ftp://new-domain.net"). To do this we need to use a modifier.
Modifiers alter the behavior of the regular expression. The previous substitution example replaces only the first occurence of the search string; once it finds a match, it performs the substitution and stops. To modify this regex in order to replace all matches in the string, we need to add the "g" modifier.
/(ftp|http):\/\/old-domain\.(com|net|org)/\1://new-domain.\2/g
http://old-domain.com and ftp://old-domain.net
http://new-domain.com and ftp://new-domain.net
The "i" modifier causes the match to ignore the case of alphabetic characters. For example:
/ActiveState\.com/i
activestate.com ActiveState.com ACTIVESTATE.COM
Modifier | Meaning |
i | Ignore case when matching exact strings. |
m | Treat string as multiple lines. Allow "^'' and "$'' to match next to newline characters. |
s | Treat string as single line. Allow ".'' to match a newline character. |
x | Ignore whitespace and newline characters in the regular expression. Allow comments. |
o | Compile regular expression once only. |
g | Match all instances of the pattern in the target string. |
Komodo's Search features (including "Find...", "Replace..." and "Find in Files...") can accept plain text, glob style matching (called "wildcards" in the drop list, but using "." and "?" differently than regex wildcards), and Python regular expressions. A complete guide to regexes in Python can be found in the Python documentation. The Regular Expression HOWTO by A.M. Kuchling is a good introduction to regular expresions in Pyhon.
Beginner:
Intermediate:
Advanced:
Language-Specific: