Modelworks Software Logo


Regular Expressions

Home

Products

Product
Comparisons


Features

Screen Shots

Downloads

Updates

Free Licenses

Support

Editors
   Java
   HTML
   Perl
   JSP
   JavaScript
   PHP
   VBScript
   Velocity
   XML


A regular expression is a pattern that describes what to match during a search. It is often called a regex, regexp or even just RE. The pattern is compiled by a regex engine before it used to match a sequence of characters.

The regex engine in JPad Pro and SitePad Pro uses a syntax that is compatible with the regular expression syntax in Perl 5 (plus extensions). If you need more help using regular expressions we recommend that you search online for regular expression help. If you are new to regular expression we recommend Web Reference's Regular Expression Tutorial. Unless you know Perl, we do not recommend that you start with Perl regular expression tutorials. If you need more information you may want to take a look at Mastering Regular Expressions, a 400 page book published by O'Reilly.

Some Examples

  • The regular expression "[^"]*" will match a double quoted string. This expression says to first find a ". Then match all characters except a " and then include the final ". The expression [^"]* defines the set of characters to match. In this case it is the compliment set (match all except "). The * says to match 0 or more characters.

  • The previous regular expression is fine provided that you do not have escaped quotes in your string. The solution to handle escaped quotes within a string is "(\\.|[^\\"])*". The difference is that this regex accepts either backslash followed by another character (which can be a quote or backslash) or any character that is not a backslash or quote between quotes.

  • The regular expression <[^>]*> will match all HTML/XML tags. This expression says to first find a <. Then match all characters except a > and then include the final >. The expression [^>]* defines the set of characters to match. In this case it is the compliment set (match all except >). The * says to match 0 or more characters.

  • The regular expression <(?!/)[^>]*> will only match open tags. The only difference between this expression and the last one is the (?!/). This is a negative lookahead expression which in this case says not to match a / after the <.

  • The regular expression </[^>]*> will only match close tags. The only difference between this expression and the last one is that the (?!/) has been replaced by a /.

Regular Expression Syntax

All characters are literals (a literal is a character to match when searching) except one of the following: .|*?+(){}[]^\$. To match any of these non-literal characters you will need to escape the character using the backslash.

The '.' is the wildcard character that will match any character including a newline, '\n'.

The repeat characters are *+?. The * means zero or more matches. The + means 1 or more matches and the ? means zero or one match. The bounds operator is {}. The bounds operator lets you specify the minimum and maximum number of repeats explicitly. For example, x{2} would match xx and x{2,4} would match xx, xxx or xxxx.

Parentheses are used for grouping and reporting submatches. If you do not want the group reported you can used the non-reporting syntax (?:expression) group.

Lookahead assertions are provided by (?=expression) and (?!expression). The (?=expression) matches zero characters only if followed by the expression. The (?!expression) matches zero characters only if not followed by the expression.

Alternatives are separated by a |. Alternatives are matched by using the largest possible previous sub-expression.

Sets let you specify a set of characters that can match any single character that is a member of the set. Sets are delimited by []. Sets that start with the ^ contain the compliment of the list of entities.

Entities include the following character classes:

    alnum - any alpha numeric character
    alpha - any alphabetical character a-z and A-Z
    blank - any blank character, either a space or a tab
    cntrl - any control character
    digit - any digit 0-9
    graph - any graphical character
    lower - any lower case character a-z
    print - any printable character
    punct - any punctuation character
    space - any whitespace character
    upper - any upper case character A-Z
    xdigit - any hexadecimal digit character, 0-9, a-f and A-F
    word - any alphanumeric character plus the underscore
    unicode - any character whose code is greater than 255

To specify a character class use the syntax "[:classname:]" within a set declaration. For example, "[[:space:]]" is the set of all whitespace characters. To include a literal - in a set declaration make it the first character after the opening [ or [^.

Line and buffer anchors

    ^ matches the null string at the start of a line
    $ matches the null string at the end of a line
    \` matches the start of the file
    \A matches the start of the file
    \' matches the end of the file
    \z matches the end of the file
    \Z matches the end of the file

Back references let you refer to a previous sub-expression that has already been matched. A back reference consists of the escape character \ followed by a digit 1 to 9.

Escapes include the following:

    \s matches any [:space:]
    \S matches any [^[:space:]] 
    \d matches any [:digit:] 
    \D matches any [^[:digit:]]
    \l matches any [:lower:] 
    \L matches any [^[:lower:]]
    \u matches any [:upper:] 
    \U matches any [^[:upper:]]
    \w matches any [:word:] 
    \W matches any [^[:word:]]
    \< matches the null string at the start of a word
    \> matches the null string at the end of the word
    \b matches the null string at either the start or the end of a word. 
    \B matches a null string within a word

You can use the quote operator \Q to treat all following characters as literals until the \E character.

Regular Expression Implementation Notes

To match the tilde you will need to use the expression \\~.

To match new lines just use '\n' for all OS file types (Windows, Mac and UNIX). You also need to be aware that the '.' matches all characters including '\n'. The [:space:] class also includes '\n' in its set.

 
Copyright 1996-2008  |  About Us  |  Privacy Statement  |