Regex Concepts
✕Regex: Regular Expressions
- Used to
search,validate,extract,replacetext based on specific patterns. - Most underestimated tool for data cleaning and feature engineering.
- Can be found even in normal
Text editors,Excel,SQL,programming languages. - We will do lab in online tool Regex Tester.
- Syntax can be cryptic but powerful once mastered.
- Use Case:
validating emails,extracting info from logs,cleaning messy text data.
Literals
- Like normal text search, we can match exact characters in regex.
- Example:
catmatches "cat" in "The cat is on the roof." - There are some special characters in regex that have different meanings.
- To match special characters, we need to escape them with a backslash
\. - Some Special Characters:
.,^,$,*,+,?,{,},[,],|,(,),\.
Example of Regex Match with Literals:
| Text | Desired Match | Regex Pattern |
|---|---|---|
| The cat is on the roof. | cat | cat |
| The price is $100. | $100 | \$100 |
| I have 10+ years of experience. | 10+ | 10\+ |
| Today is 2026.06.15 or 2026/06/15 | 2026.06.15 | 2026\.06\.15 |
Examples of regex patterns for matching specific text
Special Characters
Common Regex Special Characters and Their Meanings:
| Character | Meaning |
|---|---|
| \n | Matches a newline character. |
| \t | Matches a tab character. |
| \d | Matches any digit character (equivalent to [0-9]). |
| \D | Matches any non-digit character (equivalent to [^0-9]). |
| \w | Matches any word character (equivalent to [a-zA-Z0-9_]). |
| \W | Matches any non-word character (equivalent to [^a-zA-Z0-9_]). |
| \s | Matches any whitespace character (spaces, tabs, etc.). |
| \S | Matches any non-whitespace character. |
| \b | Matches a word boundary. |
| \B | Matches a non-word boundary. |
| . | Matches any single character except newline. |
| \ | Escapes special characters to match them literally. |
| ^ | Matches the start of a string. |
| $ | Matches the end of a string. |
Common regex special characters and their meanings
Character Class
- Allow us to define a set of characters to match.
- Syntax:
[abc]matches any one of the characters a, b, or c. - Ranges can be defined with a hyphen:
[a-z]matches any lowercase letter. - Negation can be done with
^:[^0-9]matches any non-digit character. - If we want literal
-,^, we need to escape them or place them in specific positions.
Examples of Character Classes in Regex:
| Pattern | Description | Example Match |
|---|---|---|
[aeiou] | Matches any vowel character. | Matches "a" in "cat", "e" in "bed". |
[A-Z] | Matches any uppercase letter. | Matches "H" in "Hello", "W" in "World". |
[0-9] or \d | Matches any digit character. | Matches "1" in 1Rs, "5" in 5do. |
Examples of character classes in regex and their descriptions
Group
- Used to combine multiple characters or patterns together.
- Helpful on divide and conquer approach to complex patterns.
- Used as
(). - Text matched inside parentheses can be referenced later.
\1,\2, etc. refer to the first, second, etc. capturing group.- Example:
-
d{4}([-.\/]?)d{2}\1d{2}# Matches date with same separator. - https?://(www.)?(w+)(.w+) # Second group (\2) captures domain name. - Used as
(?:...). - Groups the pattern but does not capture it for back-referencing.
- Example:
-
(?:Mr|Mrs|Ms) [a-zA-Z]+# Matches titles without capturing them. - https?://(?:www.)?(w+)(?:.w+) # Capture domain as\1. Rest non-capturing.
Capturing Groups
Non-Capturing Groups
Alteration
- Use
|to match either pattern on left or right. cat|dogmatches - "cat" in "The cat is on the roof." - "dog" in "The dog is barking."- Can be used to match multiple variations of a pattern without repetition.
- (r1|r2|r3) matches any of the patterns r1, r2, or r3.
- Example:
-
(Mr |Mrs |Ms )[a-zA-Z]# Matches "Mr J", "Mrs S", "Ms D" -\b(cat|dog)\b# Matches "cat" or "dog" as whole words. -(Mr |Mrs |Ms )[a-zA-z]+# Matches "Mr John", "Mrs Smith", "Ms Davis"
Quantifiers
- Used to specify how many times the previous element should be matched.
- For sample text
I bought 100 apples and 50 oranges.
Common Regex Quantifiers:
| Quantifier | Description | Regex | Match |
|---|---|---|---|
| * | Matches 0 or more occurrences of the preceding element. | \d* | "", 100, 50 |
| + | Matches 1 or more occurrences of the preceding element. | \d+ | 100, 50 |
| ? | Matches 0 or 1 occurrence of the preceding element. | \d? | "", 1, 0, 5 |
| {n} | Matches exactly n occurrences of the preceding element. | \d{3} | 100 |
| {n,} | Matches n or more occurrences of the preceding element. | \d{2,} | 100, 50 |
| {m,n} | Matches between m and n occurrences of the preceding element. | \d{1,3} | 50, 100 |
Common regex quantifiers and their descriptions
Greedy vs Lazy Quantifiers
- Greedy quantifiers (default) matches as much of the searched string as possible.
- Lazy quantifiers matches as little of the searched string as possible.
?after a quantifier makes it lazy.- For sample text
<div>Content</div><div>More</div>
Examples of Greedy vs Lazy Quantifiers in Regex:
| Quantifier | Description | Regex | Match Data |
|---|---|---|---|
| .* | Greedy quantifier | <div>.*</div> | <div>Content</div><div>More</div> |
| .*? | Lazy quantifier | <div>.*?</div> | <div>Content</div> and <div>More</div> |
Lookaround
- Assert pattern is [not] preceded/followed by another pattern without adding in match
- Positive Lookahead:
(?=...)Looks forward to assert provided regex matches. - Negative Lookahead:
(?!...)Looks forward to assert provided regex doesn't match. - Example:
-
\d(?=kg)matches a digit only if it is followed by "kg". -\d(?!kg)matches a digit only if it is not followed by "kg". - Positive:
(?<=...)Looks backward to assert provided regex matches. - Negative:
(?<!...)Looks backward to assert provided regex doesn't match. - Example:
-
(?<=\$)\d+matches a sequence of digits only if it is preceded by a "$" sign. -(?<!\$)\d+matches a sequence of digits only if it is not preceded by a "$" sign.
Lookahead
Lookbehind
Lookaround Examples
Examples of Lookaround in Regex:
| Pattern | Description | Example Match |
|---|---|---|
| (?<=Mr\. )w+ | Matches Name preceeded by title | Mr. Ram |
| w+(?=:\s) | Matches word followed by colon and space | John: Doe |
| (?<!hot)dog | Matches "dog" not preceded by "hot" | The dog ran away. |
| \d+(?=kg) | Matches digits followed by "kg" | Weight: 50kg |
| (?<=@)w+(?=.(com|edu)) | Matches text preceded by @ & followed by .com or .edu | mail: john.doe@university.edu |
Examples of lookaround assertions in regex and their descriptions
Anchors
- Used to bind pattern to start or end of string.
- ^: Matches the start of a string.
- $: Matches the end of a string.
- For sample text "The cat is on the roof."
Examples of Anchors in Regex:
| Pattern | Description | Example Match |
|---|---|---|
| ^the | Matches "the" at the start of the string. | the cat is on the roof. |
| roof\.$ | Matches "roof." at the end of the string. | The cat is on the roof. |
| ^The.*\.$ | Matches the entire string if it starts with "The" and ends with a period. | The cat is on the roof. |
Examples of anchors in regex and their descriptions
