Regex Lookarounds
Posted on: October 25, 2023
I have been confused about the regular expression's lookahead and lookbehind, collectively called "lookaround". In this blog I am going to talk about my own understandings about this concept after reading the (?...) syntax post and password validation post. And here is a regex cheatsheet consisting of regex assertations.
Lookaround Assertations
A lookaround does not "consume" any characters on the string, meaning the regex engine matches characters by looking but it does not move from the spot it starts matching.
Lookahead
Example: \d+(?= dollars)
- Sample match: 100 dollars
- Explanation:
\d+
matches "100", then the lookahead(?= dollars)
asserts that at that position in the string, what immediately follows is " dollars"
Confusion on password validation
Match a lowercase letter
-
[a-z]*
:*
: matches zero or more occurrences So,[a-z]*
matches any string that consists entirely of lowercase letters, it won't match strings containing uppercase letters or other characters.
-
.*[a-z]
:.
: matches any characters except line breaks.*
: matches any characters (except line breaks) zero or more times So,.*[a-z]
matches any string that contains at least one lowercase letter, it can include any characters (including none) before the lowercase letter. E.g. it can match strings like "AbCdEfGh".
-
.*?[a-z]
:.*?
: matches any characters (except line breaks) zero or more times,?
makes it non-greedy. Non-greedy matching means it will match as few characters as possible to fulfill the rest of the pattern. So,.*?[a-z]
will match the shortest possible sequence of characters that ends with a lowercase letter. For example, in the string "abcABC", it will match "abc", stopping at the first lowercase letter.
-
[^a-z]*[a-z]
:^
: negated class[^a-z]
: matches one character that is not a lowercase letter So,[^a-z]*[a-z]
matches any string that contains at least one lowercase letter. It can include uppercase letters, digits, special characters, or spaces before the lowercase letter.
Therefore, this lookahead (?=[^a-z]*[a-z])
asserts: at this position in the string (i.e., the beginning of the string), we can match zero or more characters that are not lowercase letters, then we can match one lowercase letter: [a-z]
.
.*
Assertation
Lookarounds allow us to look ahead or behind the string sequence at the current position (in this case, at the beginning of the string), but not moving, so there is no backtracking.
In the regex \A(?=\w{6,10}\z)(?=[^a-z]*[a-z])(?=(?:[^A-Z]*[A-Z]){3})(?=\D*\d).*
, the ending .*
helps us gobble up the string after validation.
Rules
- The order of lookaheads don't matter on logical level, as the lookaheads won't change our position.
- For n conditions, use n-1 lookaheads. Often you can combine several conditions into a single lookahead.