Regular expression
A regular expression is used in computer software to define a sequence of characters.[1]
In general, any character will match itself, but there are a dozen special characters, including the escape character.
To match any of 2 or more characters, enclose them in square brackets. For example,
gr[ae]y
will match gray or grey.
A regular expression is matched from left to right and is processed one token at a time. Certain characters have special meanings within the description of a regular language:
- '*' - the previous construction is matched 0 or more times
- '+' - the previous construction is matched 1 or more times
- '?' - the previous construction is matched 0 or 1 times
- '(...)' - the contents between the parentheses is considered a single construction
- '...|...' - the construction on the left of the '|' may be matched or the construction on the right of the '|' may be matched.
- '[...]' - match one of the characters contained within the square braces.
- '.' - match any single character
- '^' - beginning of a string
- '$' - end of a string
There also exists a wide range of special character classes distinguished with a backslash (a small list):
- '\\' - a literal backslash
- '\.' - a literal period
Regular expressions have also been extended by many languages, some of which extend them to the point where they are able to match a wider range of languages than is specified by a regular language.
Examples
- /Mrs?\. Smith/ - matches 'Mr. Smith', 'Mrs. Smith'
- /^a*b*$/ - matches 'a', 'ab', 'b', 'aaaab', 'abbbbb', 'aaabbb',
- /[1-9][0-9]*(\.0|([1-9][0-9]*))+/ - matches version numbers: 1, 1.0, 2.3, 103.4, 42.5.6.8.9, 7.0.8.9
Formal definition and Limitations
A regular expression is particular instance of a non-deterministic finite state automaton. Regular expressions are a type-3 grammar in the Chomsky hierarchy of language.
A regular expression is not able to count. This is because there is a finite number of states. Consider the language that is specified by anban. Examples of this language include b, aba, aabaa, aaabaaa, etc... A regular expression - being a finite state automaton itself - has a finite number of states that it can be in. If there a point at which the state loops back on itself it is no longer able to match that language.
See also
Further reading
- ↑ "A regular expression, or regex for short, is a pattern describing a certain amount of text." Regular Expression Quick Start