An Introduction to Regular Expressions

Regular expressions are a very useful tool for a variety of string related tasks. In Kettle they are frequently used for extraction and manipulation tasks, as well as for specifying groups of file names. This post gives an introduction to regular expressions in general as well as some applications within Kettle a.k.a. PDI. Since the built-in Kettle steps use the Java regular expression engine, the samples in this post will be notated to work with the Java implementation of regular expressions. The article contains many samples to explain each concept encountered in regular expression syntax. The download package contains many samples of how to use regular expressions within PDI. Let’s start with a basic question that may present itself if you’ve never worked with regular expressions before.

What is a Regular Expression?

On an abstract level a regular expression, regex for short, is a representation of a set of strings. Say there is a regex representing valid zip codes. Instead of having a list and thus the complete set of strings that are valid zip codes, it’s often more practical to have a short and precise regex that completely describes the set. Regular expressions are notated as patterns. As an example consider the set of strings that end in “.csv”. The following is a regular expression pattern that represents this set.

^.*\.csv$

The mechanics of this particular pattern aside, a regex engine can test a pattern against an input string to see if it matches. If it does, the input string is an element of the set of strings represented by that regex pattern. The above pattern matches foo.csv, but does not match bar.txt or my_csv_file. This is to say foo.csv belongs into the set of strings ending in “.csv” and bar.txt and my_csv_file do not.

People solve real-world string processing problems using regex patterns leaving aside the abstract idea of a regex representing a “set of strings” and concentrating on the problem at hand. “Can I test if a string is a valid zip code?” for example. If a particular problem seems to be very challenging however, it may help to revert to the abstract view that is used for reasoning about regular expressions. Regular expressions have inherent limitations every practitioner should be aware of. Failing to recognize them may lead to a prolonged and frustrating experience equivalent to the attempt of finding an exact rational representation of Pi.

What are Regular Expressions used for?

Regular expressions are useful in any scenario that benefits from full or partial pattern matches on strings. These are some common use cases:

  • verify the structure of strings
  • extract information form strings
  • search / replace / rearrange parts of the string
  • split a string into tokens

All of these come up regularly when doing ETL work. Consequently PDI supports regular expressions in many places. There are several steps dealing with string validation and manipulation. Regex patterns are also often used to specify/filter file names. This post explains the basics of regular expressions and provides samples for many PDI features that support them in the download package.

Following the Examples

Since I suspect most readers will be looking at the examples from an ETL perspective, here’s how to quickly test regular expressions in PDI 4.x: Create a new transformation and place a “Regex Evaluation” step from the “Scripting” section on the canvas. Open the step’s edit dialog and click the “Test regEx” button to the right.

The regular expression evaluation window allows to match regular expressions against up to three different input strings. The input field for the test strings turns green if the regex matches the input field’s current value. It turns red if the regex does not match. The “Capture” section of the evaluation window is useful for testing data extraction from the input string.

There is also a good online Java regex tester, that is very similar in functionality. It is very useful if you’d like to follow the examples in another browser window.

The Building Blocks of a Regular Expression

A regular expression pattern is constructed from various building blocks. It may contain literals, character classes, boundary matchers, quantifiers, groups and the selection operator a.k.a. the OR operator. Each concept is introduced in the following sections and each is supported by some example patterns.

Literals

The most basic building block in a regular expression is a character a.k.a. literal. Most characters in a regex pattern do not have a special meaning, they simply match themselves. Consider the following pattern:

I am a harmless regex pattern

None of the characters in this pattern carries special meaning. Thus each character of the pattern matches itself. Therefore there is only one string that matches this pattern, and it is identical to the pattern string itself: “I am a harmless regex pattern”.

Escaping Literal Characters

What are then the characters that do have special meaning? The following list shows the characters that have special meaning in a regular expression, and must therefore be escaped by a backslash character if they are supposed to be processed as literal characters.

Character Meaning Escaped
^ boundary matcher \^
$ boundary matcher \$
\ escape character \\
{ quantifier notation \{
} quantifier notation \}
[ character class notation \[
] character class notation \]
( group notation \(
) group notation \)
. predefined character class \.
* quantifier \*
+ quantifier \+
? quantifier \?
| OR operator \|

Escaping non-printable Characters

Sometimes it’s necessary to refer to some non-printable (or not-easily-typed) special character. The following table gives the notation details for common special characters, as listed by the Java documentation.

Escape Sequence Meaning
\t The tab character (‘\u0009′)
\n The newline (line feed) character (‘\u000A’)
\r The carriage-return character (‘\u000D’)
\f The form-feed character (‘\u000C’)
\a The alert (bell) character (‘\u0007′)
\e The escape character (‘\u001B’)
\cx The control character corresponding to x
\0n The character with octal value 0n (0 <= n <= 7)
\0nn The character with octal value 0nn (0 <= n <= 7)
\0mnn The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7)
\xhh The character with hexadecimal value 0xhh
\uhhhh The character with hexadecimal value 0xhhhh

As an example consider the following pattern:

\+21\.5

The pattern consists of literals only, and thus matches only one string: +21.5

Escaping Sequences of Characters

Sometimes a pattern requires a lot of characters to be escaped as literals, say it’s supposed to match three question marks enclosed by backslashes. The pattern would look like this:

\\\?\?\?\\

The need to escape every character as literal makes it harder to read and to understand. If there are sequences of characters in a pattern that should be interpreted as literals, they can be enclosed by \Q and \E for convenience. The following pattern is equivalent to the above:

\Q\???\\E

Escaping parts of a pattern can also be useful if it is constructed by different dynamic parts, some of which are to be interpreted literally, like for example a search word coming from user input.


The OR Operator

The OR operator, notated as a pipe character, allows alternatives in the input string. Suppose a regular expression should match the strings -1 and +1. The following pattern does the trick:

\+1|-1

The pattern gives the allowed alternatives verbatim. Please note that the | operator’s arguments are the entire expressions to the left and right of it. The following pattern matches William Turner and Bill Turner.

William Turner|Bill Turner

What’s interesting here is that the latter part of the alternatives is consistently Turner. Would it not be convenient to put the alternatives William and Bill up front, and notating Turner only once? The following pattern does that:

(William|Bill) Turner

It certainly looks a bit more readable, but introduces a new concept: Groups.

Groups

Groups, notated by round brackets, have several uses in regular expressions. They do what you’d expect them to: They group the contained expressions into a single unit. Grouping up parts of a pattern has many uses. Groups are used for simplifying regex notation, applying quantifiers to sub-expressions, and they are also useful in search and replace operations. In the above example a group was used to make the pattern more concise. But grouping the alternatives William and Bill into a separate expression has another advantage. Groups are sometimes referred to as “capture groups” because in case of a match, each group’s matched sub-string is captured, and is available for extraction. In the case of the sample, the first name ends up in the capture group. Sure enough, the regex step detects and captures the first name correctly.

Chess Notation Capturing Group Example

Consider a string identifying a chess board field. Fields on a chess board can be identified as A1-A8 for the first column, B1-B8 for the second column and so on until H1-H8 for the last column. Suppose a string containing this notation should be validated and the components (the letter and the digit) extracted using capture groups. The following regular expression would do that.

(A|B|C|D|E|F|G|H)(1|2|3|4|5|6|7|8)

While the above regular expression is valid and does the job, it is somewhat clunky. This one works just as well, and it is a bit more concise:

([A-H])([1-8])

This sure looks more intuitive. But again, it introduces a new concept: Character Classes.

Character Classes

Character classes are used to define a set of allowed characters. The set of allowed characters is notated in square brackets, and each allowed character is listed. The character class [abcdef] is equvalent to (a|b|c|d|e|f). Since the class contains alternatives, it matches exactly one character in the input string. The pattern [ab][cd] matches exactly 4 strings “ac”, “ad”, “bc”, “bd”. It does not match “ab”, as the second character of the string must match [cd].

For another example, suppose a pattern should match all strings that consist of the letter “q” followed by four lower case vowels.

q[aeiou][aeiou][aeiou][aeiou]

The pattern matches “qioue”, “qiaea”, and “queue” for example. It matches all 54 different strings starting with q followed by four lower case vowels.

Ranges

Suppose a pattern should match a two digit code. A pattern to match this could look like this:

[0123456789][0123456789]

This pattern matches all 100 two digit strings (namely the range from “00″ to “99″). It is somewhat tedious to list all possible characters in a character class verbatim. Consecutive characters (as per the used charset) can be included in a character class as “ranges” using the dash operator. The digits 0-9 are encoded sequentially in ASCII and extended charsets compatible with it, so using range notation the above pattern is condensed into this:

[0-9][0-9]

This pattern still matches the same exact two digit strings. The lower case and upper case sequences of the latin alphabet are encoded consecutively in ASCII and compatible charsets, so character classes notated for alphabetic chars are often seen too. A pattern that matches any 3-letter string that starts with an upper case letter, followed by a lower case vowel, followed by a latin letter of any case, could be written down like this:

[A-Z][aeiou][A-Za-z]

This matches strings like “RoR”, “Foo”, “Bar”, “DoS”, and “Gun”. Please note however, that special characters like äîøß do not fall into the a-z range, so if you want to match them, you’d have to include them verbatim into the class as in [a-zäîøß] or use a predefined character class that includes them, like a language specific Unicode block for Latin-1 Supplement, Greek, etc.

Since the dash character is the range operator inside a character class definition, it’s necessary to escape it in a character class, if it’s to be interpreted as a literal. So a character class that matches the plus and minus sign is written as [+\-].

Negations

Sometimes it’s useful to define a character class that matches most characters, except for a few defined exceptions. If a character class definition begins with a ^, its definition is negated. As an example, the following class allows any char that is not a digit or an underscore.

[^0-9_]

Please note that the ^ character is interpreted as a literal if it is not the first character of the class, as in [f^o], and that it is a boundary matcher if used outside character classes.

Unions

It is possible to nest character classes to achieve a union effect. This is rarely seen in practice, since a union is usually achieved by typing all included characters next to each other. [a-d[q-z]] is therefore equivalent to [a-dq-z]. Suppose however it’s necessary to allow any character as long as it is not a digit except for 4. You may end up with [4[^0-9]] representing the union of all non-digit characters with the digit 4. Therefore [4[^0-9]] matches f,r,Q,4, the tab character etc. but it does not match 7.

Intersections

Suppose a character class should match a consonant. It is possible to define a character class for a consonant by listing each non-vowel individually, but that’s a bit much to write. Negating a short vowel character class also falls short, because that class would contain all kinds of non-vowel characters including whitespace and special characters. But if that negated vowel class was intersected with all common alphabet characters, that would do the trick. The intersection operator within character classes is &&. Thus the following character class matches all lower case consonants:

[a-z&&[^aeiou]]

Predefined Character Classes

Some character classes are used so frequently that there are shorthands defined for them. Consider the character class [0-9], for example. It means “any digit character” and is used so often that it got the mnemonic notation \d. The following list shows character class shorthands supported by the Java regex implementation.

Shorthand Matches
. Any character (may or may not match line terminators)
\d A digit: [0-9]
\D A non-digit: [^0-9]
\s A whitespace character: [ \t\n\x0B\f\r]
\S A non-whitespace character: [^\s]
\w A word character: [a-zA-Z_0-9]
\W A non-word character: [^\w]
\p{Lower} A lower-case alphabetic character: [a-z]
\p{Upper} An upper-case alphabetic character:[A-Z]
\p{ASCII} All ASCII:[\x00-\x7F]
\p{Alpha} An alphabetic character:[\p{Lower}\p{Upper}]
\p{Digit} A decimal digit: [0-9]
\p{Alnum} An alphanumeric character:[\p{Alpha}\p{Digit}]
\p{Punct} Punctuation: One of !”#$%&’()*+,-./:;<=>?@[\]^_`{|}~
\p{Graph} A visible character: [\p{Alnum}\p{Punct}]
\p{Print} A printable character: [\p{Graph}]
\p{Blank} A space or a tab: [ \t]
\p{Cntrl} A control character: [\x00-\x1F\x7F]
\p{XDigit} A hexadecimal digit: [0-9a-fA-F]
\p{Space} A whitespace character: [ \t\n\x0B\f\r]

The Dot Character Class

The most ubiquitous predefined character class is the dot, and it deserves a small section on its own. It matches any character except for newline. The following pattern matches any three character string ending with a lower case x, like “pax”, “rex”, “lex”, “8 x”, “.[x" etc.

..x

In practice the dot is often used to create "anything might go in here" sections in a string pattern. It is frequently combined with a quantifier and the short .* is often used to represent "anything" or "don't care" sections.

Please note that the . character loses its special meaning, when used inside a character class itself. The character class [.,] simply matches two characters, the dot and the comma.

The dot can be extended to also match line terminators if the DOTALL execution flag is enabled.

Unicode in Character Classes

All Unicode characters are allowed as literals in a regex pattern. Feel free to include Unicode characters like ω or 는 directly in the pattern. It is possible to match Unicode blocks and categories as well. The following notation for Unicode character classes is supported:

Shorthand Matches
\p{InGreek} A character in the Greek block (simple block)
\p{Lu} An uppercase letter (simple category)
\p{Sc} A currency symbol
\P{InGreek} Any character except one in the Greek block (negation)
[\p{L}&&[^\p{Lu}]] Any letter except an uppercase letter (subtraction)

When referencing Unicode blocks the block name is prefixed with “In” and all spaces are omitted from the block name. So the block “Braille Patterns” is referenced by p{InBraillePatterns}.

Unicode categories (both normative and informative) are referenced directly by name, which can optionally be prefixed by “Is”. So math symbols can be matched by p{Sm} and p{IsSm}.

Unicode Block Matching Sample

As an example consider the following character class. It matches pretty much every printable ASCII character, as well as greek chars.

[\p{InBasicLatin}\p{InGreek}]

If the above class is extended by the * quantifier, to match any number of its chars it matches: Hello World! and Σωκράτης (Sokrates), but would not match 나는 유리를 먹을 수 있어요. 그래도 아프지 않아요.


Boundary Matchers

Boundary matchers, or “anchors” as they are sometimes called, do not match a character as such, they match a boundary. They match the positions between characters, if you will. The most common matchers are ^ and $. They match the beginning and end of a line respectively. The following table shows all available boundary matchers.

Matcher Matches
^ The beginning of a line
$ The end of a line
\b A word boundary
\B A non-word boundary
\A The beginning of the input
\G The end of the previous match
\Z The end of the input but for the final terminator, if any
\z The end of the input

Anchoring to Beginnings and Endings

Consider a search operation for digits on a multiline text. Using [0-9] as the regular expression finds every digit in the text, no matter where it is located. Using ^[0-9] finds every digit that is the first character on a line. The same idea applies to $, but means the end of a line. When looking for lines ending in an upper case character, the following regular expression [A-Z]$ does the job.

The \A and \Z or \z anchors can be useful if matching multi-line strings. The \Z variant is tolerant of trailing newlines and matches just before that, effectively discarding any trailing newline. Suppose the requirement is to match a two line record specifying a chess position. This is what the input strings looks like:

Column: F
Row: 7

The following pattern matches the above structure:

\AColumn: [A-H]\nRow: [1-8]\Z

Whole Word Matches

The \b anchor matches the edge of any alphanumeric sequence. This can be useful if you want to do “whole word” matches. The following pattern will look for a standalone upper case I.

\bI\b

It can be extended to also allow any sequence of chars left and right of the standalone I using quantifiers. In this case it would look like this:

.*\bI\b.*

It matches Illinois is where I live and I like Illinois, but does not match Illinois is where you live as this sentence does not include a standalone I.

Misc Anchors

The somewhat esoteric non-word boundary \B is the negation of \b. It matches any position that is not matched by \b. So it matches every position within whitespace and alphanumeric sequences.

The \G boundary matcher is useful when a pattern is applied repeatedly to a string. It matches the position of the last match found. This is rarely seen in practice, as it is only useful when fetching contiguous partial string matches using a loop.

Quantifiers

Any literal or character group matches the occurrence of exactly one character. The pattern [0-9][0-9] matches exactly two digits for example. Another way of writing the above pattern is by using a quantifier. Quantifiers are notated in curly brackets and they are used to specify how many occurrences to match. The following sample is equivalent to [0-9][0-9]

[0-9]{2}

The basic notation can be extended to provide upper and lower bounds. Say it’s necessary to match between two and six digits. The exact number varies, but it must be between two and six. The following notation does that:

[0-9]{2,6}

The upper bound is optional, if omitted any number of occurences equal to or greater than the lower bound is acceptable. The following sample matches two or more consecutive digits.

[0-9]{2,}

There are some handy shortcuts/mnemonics for common quantifiers that are very frequently used in practice. Here they are:

The ? quantifier

[0-9]? is eqivalent to [0-9]{0,1} which means in this case that an empty string or a single digit string are a match for this pattern.

The + quantifier

[0-9]+ is equivalent to [0-9]{1,} which means that a string of consecutive digits is matched. The empty string is not a match in this case, since there must be at least one digit character followed by any number of optional digit characters.

The * quantifier

[0-9]* is equivalent to [0-9]{0,} which means any number of consecutive digits is matched. The empty string matches too.

Greedy by Default

Suppose the requirement is to match the domain part from a web URL in a capture group. The following may seem like a good idea: match the protocol, then capture the domain, then an optional path. The idea translates roughly to this:

http://(.*)/?.*

It matches the protocol, captures what comes after the protocol as domain and it allows for an optional slash and some arbitrary text after that, which would be the optional path. Strangely enough, the following is captured by the group given some input strings:

Input Captured
http://foo.com/some_resource.html foo.com/some_resource.html
http://google.com/ google.com/
http://myapi.net/some/resource/path.html?foo=bar myapi.net/some/resource/path.html?foo=bar

The results are somewhat surprising, as the pattern was designed to capture the domain part only, but it seems to be capturing everything till the end of the URL. This happens because each quantifier encountered in the pattern tries to match as much of the string as possible. The quantifiers are called greedy for this reason. The * in the capturing group is the first encountered quantifier. The dot character class it applies to matches any character, so the quantifier extends to the end of the string. Thus the capture group captures everything. But wait, you say, there’s the /?.* part at the end. Well, yes, and it matches what’s left of the string (nothing, i.e. the empty string) perfectly. So the entire pattern matches just fine.

Well, greedy is not the only flavor of quantifiers. Each quantifier has a reluctant version, that matches the least possible amount of characters. The greedy versions of the quantifiers are converted to reluctant versions by appending a ? to them.

The following table gives the notations for all quantifiers.

Greedy Reluctant Matches
{n} {n}? exactly n appearances
{n,m} {n,m}? between n and m appearances (inclusive)
{n,} {n,}? n or more appearances
? ?? 0 or 1 appearances
+ +? 1 or more appearances
* *? 0 or more appearances

The first form {n} is equvalent in both greedy and reluctant versions. For the others the number of matched characters may vary. Let’s revisit the example from above and change the capture group to match as little as possible, in the hopes of getting the domain name captured properly.

http://(.*?)/?.*

In this case nothing (the empty string) is captured by the group. Why is that? The capture group now captures as little as possible: nothing. The (.*?) captures nothing, the /? matches nothing, and the .* matches the rest of the string. So again, this pattern does not work as intended.

So far the capture group matches too little or too much. Now here’s another idea. Let’s revert back to the greedy quantifier, but disallow the slash character in the domain name, and also require that the domain name be at least one character long.

http://([^/]+)/?.*

This pattern greedily captures one or more non slash characters after the protocol as the domain, and if finally an optional slash occurs it may be followed by any number of characters in the path. The following table gives the capture result:

Input Captured
http://foo.com/some_resource.html foo.com
http://google.com/ google.com
http://myapi.net/some/resource/path.html?foo=bar myapi.net
http://www.slashdot.org www.slashdot.org

That looks more like it. Depending on the certainty about the processed strings being valid URL strings, the pattern may be enhanced to ensure reliability. It may make sense to check for conformance to URL syntax, whitespace etc.

Quantifier Performance

Both greedy and reluctant quantifiers imply some runtime overhead, because determining the longest possible or shortest possible matches is a nontrivial operation that implies running back and forth on the string adjusting the length of each quantifier’s match to determine whether the expression as a whole does match. Pathological cases of catastrophic backtracking may sometimes occur. If performance is a major concern, it’s usually best to make good use of reluctant quantifiers and maybe also have a look at a third kind of quantifiers: the possessive kind. They sound devilish, and to some extent they are, because they do not support backtracking. Nevertheless, they are a fast performing version of “greedy-like” quantifiers, which makes them a good choice for performance sensitive operations. They are explained very well here.

Back References

Sometimes it’s useful to refer to something “I saw earlier in the string”. Suppose a string value is only valid if it starts and ends with the same letter, like the words “alpha”, “radar”, “kick”, “level” and “stars” do. It is possible to capture the first letter of a string in a group and refer to that group at the end of the pattern: a back reference. Back references in a regex pattern are notated using \n syntax, where n is the number of the capture group. The numbering is left to right starting with 1. If groups are nested, they are numbered in the order their opening parenthesis is encountered. Group 0 always means the entire expression. For the above case, the following regex ensures that the word has at least 3 characters and that it starts and ends with the same letter:

([a-zA-Z]).+\1

In a more elaborate example, a string is only matched if it contains any alphanumeric sequence (think: word) more then once. The duplicate word is captured in a group for leisurely examination. Word boundaries are used to ensure that whole words are matched.

.*\b(\w+)\b.*\b\1\b.*

The pattern matches strings like: “who is who” (duplicate: who), “come and see the rising of the star” (duplicate: the) and “12 24 24 9″ (duplicate: 24) but does not match “this sentence does not have a duplicate word”.


Search and Replace with Back References

Regular expressions are often used in search and replace operations. The usual use case is to look for a sub-string that matches a pattern and replace it with something else. Say all digits in a string need to be replaced with the 0 character. Looking for [0-9] and replacing with 0 does the trick. Regular expressions are even more flexible then that, as they allow to reference capture groups from the search pattern in the replacement string. These back references effectively allow to rearrange parts of the input string. The following example shows the use of back references in replacement strings.

Consider the following scenario: the input string contains an A-Z character prefix followed by an optional space followed by a 3-6 digit number. Things like A321, B86562, F 8753, L 287. The task is to convert it to another string consisting of the number, followed by a dash, followed by the character prefix.

Input Output
A321 321-A
B86562 86562-B
F 8753 8753-F
L 287 287-L

The first step to transform one string to the other is to capture each part of the string in a capture group. The search pattern looks like this:

([A-Z])\s?([0-9]{3,6})

It captures the prefix into a group, allows for an optional space character, then captures the digits into a second group. Back references in a replacement string are notated using $n syntax, where n is the number of the capture group. The replacement string for this operation should first reference the group containing the numbers, then a literal dash, then the first group containing the letter prefix. This gives the following replacement string:

$2-$1

Thus A321 is matched by the search pattern, putting A into $1 and 312 into $2. The replacement string is arranged to yield the desired result: The number comes first, then a dash, then the letter prefix. Please note that, since the $ sign carries special meaning in a replacement string, it must be escaped as $$ if it should be interpreted as a literal character.


Execution Flags

Many regex engines allow fine tuning of some aspects of the pattern matching process, like case sensitivity for example. The Java implementation allows setting the following options, all of which are available on from the Regex Evaluation step dialog in PDI.

Inlining Execution Flags

Some of the flags can be turned on and off in the regex pattern itself. Sometimes it’s more preferable to encode the execution flags in the pattern directly, especially if you’d like to change them mid-pattern. Setting execution flags inline is also useful, when there’s no way to set them up front. Consider for example the “Replace in String” PDI step, which optionally accepts the search string to be a regex pattern, but does not offer a user interface to set execution options for it.

Each inline flag has a letter assigned to it, that can be used to switch it on or off. And here they are:

Execution Flag Flag Letter
UNIX_LINES d
CASE_INSENSITIVE i
COMMENTS x
MULTILINE m
DOTALL s
UNICODE_CASE u

The syntax to set the flags is (?flags to turn on-flags to turn off). So the expression to turn on CASE_INSENSITIVE is (?i), and to turn it off again, it’s (?-i). The following expression turns on DOTALL and UNICODE_CASE while turning off UNIX_LINES (?su-d). This one turns UNIX_LINES on and UNICODE_CASE off: (?d-u).

The following pattern matches the greek word άλφα (alpha) in a case insensitive way, by turning on CASE_INSENSITIVE and UNICODE_CASE:

(?ui)άλφα


Non-Capturing Groups

Regex patterns may contain non-capturing groups, which behave exactly the way normal groups behave, except that they do not make their matched content available. If there’s no need to capture the content, they can be used to improve matching performance by a notch. Non-capturing groups are notated as (?:expression), so they look very similar to regular groups. Suppose the requirement is to verify that a string consists of an even number of digits. The following expression does the job using a group:

([0-9][0-9])+

Since the point of the group in the pattern is to make sure that the digits come in pairs, and the digits actually matched are not of any relevance, the group may just as well be replaced with the faster performing non-capturing group:

(?:[0-9][0-9])+

Embedded flags can conveniently be applied to a non-capturing group: Just put the flags between the question mark and the colon: (?flags to turn on-flags to turn off:expression). So if the requirement is to match any string that contains the sequence “H2O” in any combination of lower and upper case characters, the following pattern does the trick:

.*(?i:H2O).*

It is equivalent to .*(?i)H2O(?-i).*, which turns case insensitivity on and off using embedded flag syntax.

There is also a fast-performing version of a non-capturing group, that does not support backtracking. It is called the “independent non-capturing group” or “atomic group”. Its notation is (?>expression) and it is essentially a fail-fast, fast-forward version of a non-capturing group, that can be used to optimize pattern matching for speed. Its behavior is similar to possessive quantifiers, and it’s explained here very well.

Look, but don’t touch: Lookahead and Lookbehind

When dealing with partial string matches, it is sometimes useful to make sure that a string has a certain structure, without going all the way and actually matching it. These constructs are called “lookahead” and “lookbehind”. They are particularly useful when using regular expressions to split strings using a pattern as a delimiter. In these cases the string is scanned left to right for a match of the pattern. Once a part of the string matches, the content before that is split off as an individual item and the actual matching part (the delimiter) is discarded. The process is repeated till the end of the string, the resulting items are kept in an array.

To give a simple example for splitting using a regex consider the string “a:b:c:d:e”. If it is split using “:” as a delimiter, the result is an array of one character strings ["a","b","c","d","e"].

Consider another string “time_a:9:23:time_b:10:11″. Splitting this on “:” would give ["time_a","9","23","time_b","10","11"]. The desired result however, is to extract each label and the encoded time separately. So the desired result is ["time_a","9:23","time_b","10:11"].

A first stab might be to split on :[a-z]|[a-z]: instead. So splitting should only occur when the colon is preceded or followed by a character. This gives the following result: ["time_", "9:23","ime_","10:11"]. It’s almost right, but unfortunately the matched character is considered a delimiter, and thrown away. To avoid this, it’s possible to tell the pattern to do a “lookahead” or “lookbehind” and only match if it finds certain characters ahead or behind the current position. But when the pattern matches, it does not include the characters it merely “looked” at. Lookaheads are notated as (?=expression), lookbehinds as (?<=expression). The following pattern uses them to split the example string only if the colon is preceded or followed by a character, but it does not include the character in the match:

(?<=[a-z]):|:(?=[a-z])

The character is not part of the match now, thus it’s also not considered part of the delimiter and therefore the result is: ["time_a","9:23","time_b","10:11"].

Sometimes it may be useful to negate a lookahead or lookbehind expression to ensure that the current position is not preceded or followed by that expression. The notations for a negative lookahead is (?!expression) and a negative lookbehind is notated as (?<!expression). The following table gives them all:

Notation Meaning
(?=expression) positive lookahead
(?!expression) negative lookahead
(?<=expression) positive lookbehind
(?<!expression) negative lookbehind


Limitations of Regular Expressions

Arriving at the end of this post you may feel that all possible string parsing problems can be dealt with, once you get regular expressions under your belt. Well, no. Let me explain. This post introduces regular expressions as a shorthand notation for sets of strings. If you happen to have the exact regular expression for zip codes, you have a shorthand notation for the set of all strings representing valid zip codes. You can easily test an input string to check if it is an element of that set. The problem is: there are many meaningful sets of strings for which there is no regular expression. The set of valid Java programs has no regex representation, for example, so there will never be a regex that can check if a Java source is syntactically correct. This is mostly due to regex’ inherent inability to deal with nested structures of arbitrary depth. XML files are a nested structure, so is the source code of many programming languages. Palindromes, words that read the same forwards and backwards like racecar, are a very simple form of nested structure. Each character is opening or closing a nesting level.
Sets of strings representing nested structures are an example of non-regular sets of strings. There is no regular expression that could check if an XML file was well-formed, or confirm that a word was indeed a palindrome. Nested structures often turn out to be context-free but not regular. If you’re interested in computation theory and classes of languages (i.e. sets of strings), have a glimpse at the Chomsky Hiararchy, Formal Grammars and Formal Languages. If you’d like to immerse yourself fully, I’d recommend picking up Introduction to Automata Theory, Languages, and Computation. It covers all language classes of the Chomsky Hierarchy in detail.

Know when to reach for a different Hammer

Let me conclude with a word of caution. I often see “solutions” trying to use regular expressions not only for lexical analysis, which is to mean the identification and extraction of meaningful tokens from a string, but also for semantic analysis, which means to interpret and judge each token’s meaning as well. While the former is a perfectly valid use case for regular expressions, attempting the latter more often than not leads towards creating another problem. Let me illustrate with an example. Suppose a string shall be an IPv4 address in decimal notation with dots separating the numbers. A regular expression should validate that an input string is indeed an IPv4 address. The first attempt may look something like this:

([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})

It matches four groups of one to three digits separated by a dot. Some readers may feel that this pattern falls short. It does match 111.222.333.444 for example, which clearly is not a valid IP address. If you now feel the urge to change the regex so it tests for each group of digits that the encoded number be between 0 and 255 (with possible leading zeros) then you’re on your way to creating the second problem. Trying to do that leads away from lexical analysis (identifying four groups of digits) to semantic analysis (verifying that the groups of digits translate to admissible numbers) which yields a dramatically more complex regular expression, an example of which is found here. Nine times out of ten I’d recommend solving this problem by capturing each group of digits using a regex pattern, but then converting them to integers and validating their range in a separate logical step.

When working with regular expressions, the trade-off between complexity, maintainability, performance, precision and correctness should always be a conscious decision. After all, a regex pattern is as “write-only” as computing syntax can get. It is difficult to read regular expression patterns correctly, let alone debug and extend them. So my advice is to embrace them as a powerful string processing tool, but to neither overestimate their possibilities, nor the ability of human beings to handle them. When in doubt, consider reaching for another hammer in the box.

Downloads

The download package contains several regular expression samples, demonstrating the use of the techniques discussed in the article. The sample transformations are created using PDI 4.x.

Thanks for reading. As always, comments and corrections are very welcome. :)

Cheers

Slawo

13 comments to An Introduction to Regular Expressions

  • Hi Slawo

    One more time, a great post !

    It’s an excellent reference for using RegEx (with PDI and others…)

  • Sean

    This is a really well put together piece. I’ll have to read it again to get it all in. I think I’ll be referring to this quite a bit as I tend to shy away from regex. I’m adding it to Evernote in case I need off-line access!

    The Dot Character Class always confused me a bit as I do not remember using that in UNIX/PERL before. Thank you for that explanation.

    Sean

  • Slawomir Chodnicki

    Hey Sylvain,
    thanks for reading, and thanks for sticking with the blog for so long. If memory serves me right, you’ve been with it since post 1 :)

    Cheers
    Slawo

  • Slawomir Chodnicki

    Hey Sean,

    good to hear the post is helpful. I certainly learned a lot about Java regex’ details myself during the research for it :)

    Cheers
    Slawo

  • Nice job Slawo!

    A little joke to go with this post:

    Q: How did the regular expression cross the road?
    A: ^.*$

  • Slawomir Chodnicki

    Thanks Matt,

    yeah, saw that regex crossing the road, it was thinking to itself: “bb|[^b]{2}” ;)

    Cheers
    Slawo

  • Marcos

    Hi Slawo,
    it’s a very clear post, and a good guide to start using regex in PDI.

  • Slawomir Chodnicki

    Hey Marcos,

    thanks for reading. I hope it’s useful :)

    Cheers
    Slawo

  • [...] An introduction to regular expresions. A must read for the javascript step: here [...]

  • [...] Remove excesive text from product line description The first colum has text like “[Product].[Ships]“, we can remove the fixed part with a regular expression to replace the unwanted and fixed text Add Transform->.Replace in String Step. Type in Search:  ^[Product].[(.*)]$ Notice that no output field is specified so ProductLine will be used, that the dot and squared brackets are ‘escaped’ with backslash and that the replace string is (.*)=$1 For using this step: Reference1, Reference2 [...]

  • binitha

    excellent document

  • [...] Remove excesive text from product line description The first colum has text like “[Product].[Ships]“, we can remove the fixed part with a regular expression to replace the unwanted and fixed text Add Transform->.Replace in String Step. Type in Search:  ^[Product].[(.*)]$ Notice that no output field is specified so ProductLine will be used, that the dot and squared brackets are ‘escaped’ with backslash and that the replace string is (.*)=$1 For using this step: Reference1, Reference2 [...]

  • Vinod

    Hi,

    I would like to remove control characters from my input stream (specifically NUL character). I have tried using Modified Java Script and Replace in string with \00, \\x00, [[:cntrl:]] and \p{Cntrl} as the search strings. However, none of these work.

    Could you please let me know what is the right way of achieving this?

    Thanks,
    Vinod

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>