Regular Expressions Explained
A regular expression is a compact pattern that describes a set of strings, used to search, match, validate, and replace text. Rather than checking characters by hand, you write a pattern once and let the regex engine do the matching — `\d{3}-\d{4}` finds a phone-number shape anywhere in a string. The power comes from a handful of special characters that mean “any digit,” “one or more,” “start of line,” and so on, combined into expressive patterns. This guide walks through literals and metacharacters, character classes, quantifiers, anchors, capturing groups, and the classic greedy-matching pitfall that trips up nearly everyone.
1. Literals and metacharacters
Most characters in a pattern are literals that match themselves, so the regex `cat` matches the text `cat` exactly. The power comes from metacharacters, which carry special meaning: `. ^ $ * + ? ( ) [ ] { } | \` are the common ones. For instance `.` matches any single character except a newline, so `c.t` matches `cat`, `cot`, or `c9t`. To match a metacharacter literally you escape it with a backslash, so `\.` matches a real full stop and `3\.14` matches the text `3.14`.
2. Character classes
Square brackets define a character class — a set of characters, any one of which may match at that position. So `[aeiou]` matches a single vowel and `[0-9]` matches one digit via a range. A leading `^` inside the brackets negates the class, so `[^0-9]` matches any character that is not a digit. Common classes have shorthands: `\d` is a digit, `\w` is a word character (letters, digits, underscore), `\s` is whitespace, and their uppercase forms `\D`, `\W`, `\S` mean the negation of each.
3. Quantifiers
Quantifiers say how many times the preceding element may repeat. `*` means zero or more, `+` means one or more, and `?` means zero or one (optional). For precise counts, `{n}` matches exactly n times, `{n,}` n or more, and `{n,m}` between n and m. So `\d{3}` matches exactly three digits, `colou?r` matches both `color` and `colour`, and `\w+` matches a run of one or more word characters.
4. Anchors and boundaries
Anchors do not match characters; they match positions. `^` asserts the start of the string (or line in multiline mode) and `$` asserts the end, so `^abc$` matches a string that is exactly `abc` and nothing more. The word boundary `\b` matches the position between a word character and a non-word character, so `\bcat\b` matches `cat` as a whole word but not the `cat` inside `category`. Anchors are essential for validation, where you usually want the whole input to match rather than just a fragment of it.
5. Groups and capture
Parentheses group part of a pattern so a quantifier or alternation can apply to it as a unit, and they also capture the matched text for later use. So `(ab)+` matches one or more repetitions of `ab`, and `(jpe?g|png|gif)` matches any of those three extensions via the `|` alternation. Each captured group is numbered from 1 and can be referenced — as `$1` or `\1` in a replacement, or as a backreference `\1` within the pattern itself. When you want grouping without the capture overhead, use a non-capturing group `(?:...)`, and many engines also support named groups like `(?<year>\d{4})`.
6. Greedy matching and other pitfalls
By default quantifiers are greedy: they match as much as possible and then give back only if the rest of the pattern needs them to. So against `<a><b>` the pattern `<.*>` matches the entire string, not just `<a>`, because `.*` grabs everything before backtracking to the last `>`. Adding `?` after a quantifier makes it lazy, so `<.*?>` matches the smallest possible piece, just `<a>`. Other common traps include forgetting to escape literal dots, letting `.*` run wild across lines, and writing nested quantifiers that cause catastrophic backtracking on certain inputs — so keep patterns specific and test them against realistic data.