Regex Concepts

Regex: Regular Expressions

  • Used to search, validate, extract, replace text based on specific patterns.
  • Most underestimated tool for data cleaning and feature engineering.
  • Can be found even in normal Text editors, Excel, SQL, programming languages.
  • We will do lab in online tool Regex Tester.
  • Syntax can be cryptic but powerful once mastered.
  • Use Case: validating emails, extracting info from logs, cleaning messy text data.

Literals

  • Like normal text search, we can match exact characters in regex.
  • Example: cat matches "cat" in "The cat is on the roof."
  • There are some special characters in regex that have different meanings.
  • To match special characters, we need to escape them with a backslash \.
  • Some Special Characters: ., ^, $, *, +, ?, {, }, [, ], |, (, ), \.
Example of Regex Match with Literals:
TextDesired MatchRegex Pattern
The cat is on the roof.catcat
The price is $100.$100\$100
I have 10+ years of experience.10+10\+
Today is 2026.06.15 or 2026/06/152026.06.152026\.06\.15
Examples of regex patterns for matching specific text

Special Characters

Common Regex Special Characters and Their Meanings:
CharacterMeaning
\nMatches a newline character.
\tMatches a tab character.
\dMatches any digit character (equivalent to [0-9]).
\DMatches any non-digit character (equivalent to [^0-9]).
\wMatches any word character (equivalent to [a-zA-Z0-9_]).
\WMatches any non-word character (equivalent to [^a-zA-Z0-9_]).
\sMatches any whitespace character (spaces, tabs, etc.).
\SMatches any non-whitespace character.
\bMatches a word boundary.
\BMatches a non-word boundary.
.Matches any single character except newline.
\Escapes special characters to match them literally.
^Matches the start of a string.
$Matches the end of a string.
Common regex special characters and their meanings

Character Class

  • Allow us to define a set of characters to match.
  • Syntax: [abc] matches any one of the characters a, b, or c.
  • Ranges can be defined with a hyphen: [a-z] matches any lowercase letter.
  • Negation can be done with ^: [^0-9] matches any non-digit character.
  • If we want literal -, ^, we need to escape them or place them in specific positions.
Examples of Character Classes in Regex:
PatternDescriptionExample Match
[aeiou]Matches any vowel character.Matches "a" in "cat", "e" in "bed".
[A-Z]Matches any uppercase letter.Matches "H" in "Hello", "W" in "World".
[0-9] or \dMatches any digit character.Matches "1" in 1Rs, "5" in 5do.
Examples of character classes in regex and their descriptions

Group

  • Used to combine multiple characters or patterns together.
  • Helpful on divide and conquer approach to complex patterns.
  • Capturing Groups
    1. Used as ().
    2. Text matched inside parentheses can be referenced later.
    3. \1, \2, etc. refer to the first, second, etc. capturing group.
    4. Example: - d{4}([-.\/]?)d{2}\1d{2} # Matches date with same separator. - https?://(www.)?(w+)(.w+) # Second group (\2) captures domain name.
    Non-Capturing Groups
    1. Used as (?:...).
    2. Groups the pattern but does not capture it for back-referencing.
    3. Example: - (?:Mr|Mrs|Ms) [a-zA-Z]+ # Matches titles without capturing them. - https?://(?:www.)?(w+)(?:.w+) # Capture domain as \1. Rest non-capturing.

Alteration

  • Use | to match either pattern on left or right.
  • cat|dog matches - "cat" in "The cat is on the roof." - "dog" in "The dog is barking."
  • Can be used to match multiple variations of a pattern without repetition.
  • (r1|r2|r3) matches any of the patterns r1, r2, or r3.
  • Example: - (Mr |Mrs |Ms )[a-zA-Z] # Matches "Mr J", "Mrs S", "Ms D" - \b(cat|dog)\b # Matches "cat" or "dog" as whole words. - (Mr |Mrs |Ms )[a-zA-z]+ # Matches "Mr John", "Mrs Smith", "Ms Davis"

Quantifiers

  • Used to specify how many times the previous element should be matched.
  • For sample text I bought 100 apples and 50 oranges.
Common Regex Quantifiers:
QuantifierDescriptionRegexMatch
*Matches 0 or more occurrences of the preceding element.\d*"", 100, 50
+Matches 1 or more occurrences of the preceding element.\d+100, 50
?Matches 0 or 1 occurrence of the preceding element.\d?"", 1, 0, 5
{n}Matches exactly n occurrences of the preceding element.\d{3}100
{n,}Matches n or more occurrences of the preceding element.\d{2,}100, 50
{m,n}Matches between m and n occurrences of the preceding element.\d{1,3}50, 100
Common regex quantifiers and their descriptions

Greedy vs Lazy Quantifiers

  • Greedy quantifiers (default) matches as much of the searched string as possible.
  • Lazy quantifiers matches as little of the searched string as possible.
  • ? after a quantifier makes it lazy.
  • For sample text <div>Content</div><div>More</div>
Examples of Greedy vs Lazy Quantifiers in Regex:
QuantifierDescriptionRegexMatch Data
.*Greedy quantifier<div>.*</div><div>Content</div><div>More</div>
.*?Lazy quantifier<div>.*?</div><div>Content</div> and <div>More</div>

Lookaround

  • Assert pattern is [not] preceded/followed by another pattern without adding in match
  • Lookahead
    1. Positive Lookahead: (?=...) Looks forward to assert provided regex matches.
    2. Negative Lookahead: (?!...) Looks forward to assert provided regex doesn't match.
    3. Example: - \d(?=kg) matches a digit only if it is followed by "kg". - \d(?!kg) matches a digit only if it is not followed by "kg".
    Lookbehind
    1. Positive: (?<=...) Looks backward to assert provided regex matches.
    2. Negative: (?<!...) Looks backward to assert provided regex doesn't match.
    3. Example: - (?<=\$)\d+ matches a sequence of digits only if it is preceded by a "$" sign. - (?<!\$)\d+ matches a sequence of digits only if it is not preceded by a "$" sign.

Lookaround Examples

Examples of Lookaround in Regex:
PatternDescriptionExample Match
(?<=Mr\. )w+Matches Name preceeded by titleMr. Ram
w+(?=:\s)Matches word followed by colon and spaceJohn: Doe
(?<!hot)dogMatches "dog" not preceded by "hot"The dog ran away.
\d+(?=kg)Matches digits followed by "kg"Weight: 50kg
(?<=@)w+(?=.(com|edu))Matches text preceded by @ & followed by .com or .edumail: john.doe@university.edu
Examples of lookaround assertions in regex and their descriptions

Anchors

  • Used to bind pattern to start or end of string.
  • ^: Matches the start of a string.
  • $: Matches the end of a string.
  • For sample text "The cat is on the roof."
Examples of Anchors in Regex:
PatternDescriptionExample Match
^theMatches "the" at the start of the string.the cat is on the roof.
roof\.$Matches "roof." at the end of the string.The cat is on the roof.
^The.*\.$Matches the entire string if it starts with "The" and ends with a period.The cat is on the roof.
Examples of anchors in regex and their descriptions