regex.pcre #
regex.pcre Module Documentation
The regex.pcre module provides a Virtual Machine (VM) based regular expression engine with UTF-8 support. Unlike recursive engines, this implementation uses an explicit heap stack, making it safe for complex patterns and long strings without risking stack overflows.
It supports compilation of patterns, searching, full matching, global replacement, named groups, and iterative searching.
Supported Syntax
| Feature | Syntax | Description |
|---|---|---|
| Literals | abc |
Matches exact characters. |
| Wildcard | . |
Matches any character (excluding \n unless (?s) flag is used). |
| Alternation | | |
Matches the left OR right expression (e.g., cat|dog). |
| Quantifiers | * |
Matches 0 or more times. |
+ |
Matches 1 or more times. | |
? |
Matches 0 or 1 time. | |
{m} |
Matches exactly m times. | |
{m,n} |
Matches between m and n times. | |
| Groups | (...) |
Capturing group. |
(?:...) |
Non-capturing group. | |
(?P<name>...) |
Named capturing group. | |
| Anchors | ^ |
Matches start of string (or line start with (?m)). |
$ |
Matches end of string (or line end with (?m)). | |
\b |
Matches a word boundary (start/end of word). | |
\B |
Matches a non-word boundary. | |
| Classes | [abc] |
Matches any character in the set. |
[^abc] |
Matches any character NOT in the set. | |
[a-z] |
Matches a range of characters. | |
\w, \W |
Word / Non-word character ([a-zA-Z0-9_]). | |
\d, \D |
Digit / Non-digit. | |
\s, \S |
Whitespace / Non-whitespace. | |
\a |
Lowercase character ([a-z]). | |
\A |
Uppercase character ([A-Z]). | |
| Escapes | \xHH |
Matches 1-byte hex value. |
\XHHHH |
Matches 2-byte hex value. | |
| Flags | (?i) |
Case-insensitive matching. |
(?m) |
Multiline mode (^ and $ match start/end of lines). | |
(?s) |
Dot-all mode (. matches \n). |
Structs
Regex
The compiled regular expression object containing the VM bytecode.
pub struct Regex {
pub:
pattern string
total_groups int
// Internal VM bytecode...
}
Match
Represents the result of a successful search.
pub struct Match {
pub:
text string // The full substring that matched
start int // Start index in the source text
end int // End index in the source text
groups []string // List of captured groups
}
Core Functions
compile
Compiles a regular expression pattern string into a Regex object. Returns an error if the syntax is invalid (e.g., unclosed groups).
fn compile(pattern string) !Regex
Example:
import regex.pcre
fn main() {
// Compile a pattern to match a word followed by digits
// The '?' after pcre.compile handles the result option
r := pcre.compile(r'\w+\d+') or { panic(err) }
}
find
Scans the text for the first occurrence of the pattern. Returns a Match object if found, or none if not.
fn (r Regex) find(text string) ?Match
Example:
r := pcre.compile(r'(\d+)')!
text := 'item 123, item 456'
if m := r.find(text) {
println('Found: ${m.text}') // Output: 123
println('Index: ${m.start}') // Output: 5
println('Group 1: ${m.groups[0]}') // Output: 123
}
Note: This function stops immediately after finding the leftmost match.
find_all
Returns a list of all non-overlapping matches in the string. This is useful for extracting multiple tokens.
fn (r Regex) find_all(text string) []Match
Example:
r := pcre.compile(r'\d+')!
text := '10, 20, 30'
matches := r.find_all(text)
for m in matches {
println(m.text)
}
// Output:
// 10
// 20
// 30
Note: If a pattern matches an empty string (e.g.,
a*on"b"), the engine automatically advances the cursor by 1 to prevent infinite loops.
find_from
Behaves like find, but starts scanning from a specific byte index. Useful for building lexers or parsing text iteratively.
fn (r Regex) find_from(text string, start_index int) ?Match
Example:
import regex.pcre
r := pcre.compile(r'test')!
text := 'test test test'
// Skip the first 5 characters
if m := r.find_from(text, 5) {
println('Found at: ${m.start}') // Output: Found at: 5
}
Note: If
start_indexis out of bounds (< 0 or > len), it returnsnone.
fullmatch
Checks if the entire string matches the pattern from start to end.
fn (r Regex) fullmatch(text string) ?Match
Example:
r := pcre.compile(r'\d{3}')!
println(r.fullmatch('123')) // Match
println(r.fullmatch('1234')) // none (too long)
println(r.fullmatch('a123')) // none (starts with char)
replace
Finds the first occurrence of the pattern and replaces it with the replacement string.
Supported backreferences:* $1, $2, etc. refer to captured groups.
$0is currently not supported.
fn (r Regex) replace(text string, repl string) string
Example:
import regex.pcre
r := pcre.compile(r'(\w+), (\w+)')!
text := 'Doe, John'
// Swap groups
result := r.replace(text, '$2 $1')
println(result) // Output: "John Doe"
Note: This function currently replaces only the first match found. To replace all occurrences, you would need to loop using
replaceor reconstruct the string usingfind_allranges.
group_by_name
Retrieves the captured text for a specific named group defined with (?P<name>...).
fn (r Regex) group_by_name(m Match, name string) string
Example:
import regex.pcre
r := pcre.compile(r'(?P<year>\d{4})-(?P<month>\d{2})')!
m := r.find('Date: 2025-01') or {pcre.Match{}}
year := r.group_by_name(m, 'year')
println(year) // Output: 2025
Advanced Usage
VM Stability (No Stack Overflow)
Because this engine uses a VM with a heap-allocated stack, it can handle patterns that typically crash recursive engines due to stack overflow.
Example:
import regex.pcre
// A pattern that causes catastrophic backtracking in some recursive engines
// or deep recursion depth.
r := pcre.compile(r'(a+)+b')!
text := 'a'.repeat(5000) // Very long string of 'a's
// This will safely return 'none' without crashing the program
r.find(text)
Using Flags
Flags can be embedded to change matching behavior locally.
Example:
import regex.pcre
// (?i) Case insensitive
r := pcre.compile(r'(?i)apple')!
println(r.find('APPLE')) // Matches
// (?m) Multiline: ^ matches start of line, $ matches end of line
r_multi := pcre.compile(r'(?m)^Log:')!
text := 'Error: 1\nLog: Something happened'
println(r_multi.find(text)) // Matches 'Log:' on the second line
fn compile #
fn compile(pattern string) !Regex
compile parses a pattern string and returns a compiled Regex struct.
fn new_regex #
fn new_regex(pattern string, _ int) !Regex
new_regex is an alias for compile, for compatibility with older PCRE wrappers.
Note: The second argument (flags) is currently ignored as flags should be embedded in the pattern (e.g., '(?i)pattern').
fn read_rune #
fn read_rune(s string, index int) (rune, int)
read_rune decodes the next UTF-8 character from the string at the given index.
struct Match #
struct Match {
pub:
text string
start int
end int
groups []string
}
Match represents a successful match result.
fn (Match) get #
fn (m Match) get(idx int) ?string
get retrieves the captured text by index. Index 0 returns the whole match, 1+ returns capture groups.
fn (Match) get_all #
fn (m Match) get_all() []string
get_all returns the whole match followed by all capture groups.
struct Regex #
struct Regex {
pub:
pattern string
prog []Inst
total_groups int
group_map map[string]int
}
Regex holds the compiled bytecode program.
fn (Regex) fullmatch #
fn (r Regex) fullmatch(text string) ?Match
fullmatch checks if the entire input text matches the regex pattern.
fn (Regex) replace #
fn (r Regex) replace(text string, repl string) string
replace finds the first match in text and replaces it with repl. Supports $1, $2, etc. in repl for group substitution.
fn (Regex) group_by_name #
fn (r Regex) group_by_name(m Match, name string) string
group_by_name retrieves the captured text for a named group from a Match.
fn (Regex) find #
fn (r Regex) find(text string) ?Match
find scans the string text for the first occurrence of a pattern match.
fn (Regex) find_all #
fn (r Regex) find_all(text string) []Match
find_all returns a list of all non-overlapping matches in the string.
fn (Regex) find_from #
fn (r Regex) find_from(text string, start_index int) ?Match
find_from finds the first match starting search from a specific index.
fn (Regex) match_str #
fn (r Regex) match_str(text string, start_index int, _ int) ?Match
match_str is an alias for find_from, for compatibility with older PCRE wrappers.