regex.pcre #

regex.pcre Module Documentation

The regex.pcre module is a high-performance Virtual Machine (VM) based regular expression engine for V.

Key Features

Non-recursive VM: Safe execution that avoids stack overflows on complex patterns.
Zero-Allocation Search: Uses a pre-allocated Machine workspace for search operations.
Fast ASCII Path: Optimized path for characters < 128 to bypass heavy UTF-8 decoding.
Bitmap Lookups: ASCII character classes use a 128-bit bitset for $O(1)$ matching.
Instruction Merging: Consecutive character matches are mergedinto string blocks for faster execution.- Bitmap lookups: ASCII character classes use a 128-bit bitset for O(1) matching.
NFA Virtual Machine: Executes bytecode instructions to simulate pattern matching.
Dynamic Stack Growth: Automatically expands the backtracking stack to prevent false negatives.
Zero-Allocation Search: Reuses a pre-allocated Machine workspace for search operations.
Anchored Optimization: Patterns starting with '^' skip the scanning loop.
Prefix Skipping: Uses Boyer-Moore-like skipping for literal prefixes.

Supported Syntax

Feature	Syntax	Description
Literals	`abc`	Matches exact characters (UTF-8 supported).
Wildcard	`.`	Matches any character (excluding `\n` unless `(?s)` flag is used).
Alternation	`\|`	Matches the left OR right expression (e.g., `cat\|dog`).
Quantifiers	`*`, `+`, `?`	Matches 0+, 1+, or 0-1 times.
Lazy	`*?`, `+?`, `??`	Non-greedy versions of the above.
Repetition	`{m,n}`	Matches between `m` and `n` times. `{m,}` for m or more.
Groups	`(...)`	Capturing group.
	`(?:...)`	Non-capturing group.
	`(?P<name>...)`	Named capturing group.
Anchors	`^`, `$`	Start/End of string (or line with `(?m)`).
	`\b`, `\B`	Word boundary and Non-word boundary.
Classes	`[abc]`, `[^abc]`	Character set and Negated character set.
	`[a-z]`	Range of characters.
	`\w`, `\W`	Word / Non-word (`[a-zA-Z0-9_]`).
	`\d`, `\D`	Digit / Non-digit.
	`\s`, `\S`	Whitespace / Non-whitespace ( `\t\n\r\v\f`).
	`\a`, `\A`	Lowercase / Uppercase ASCII character class.
Flags	`(?i)`	Case-insensitive matching.
	`(?m)`	Multiline mode (`^` and `$` match start/end of lines).
	`(?s)`	Dot-all mode (`.` matches newlines).

Structs

Regex

The compiled regular expression object.

pub struct Regex {
pub:
    pattern      string         // The original pattern
    prog         []Inst         // Compiled VM bytecode
    total_groups int            // Number of capture groups
    group_map    map[string]int // Map for named groups
}

Match

Represents the result of a successful search.

pub struct Match {
pub:
    text   string   // The full substring that matched
    start  int      // Byte index where match starts
    end    int      // Byte index where match ends
    groups []string // Text captured by each group
}

Core Functions

`compile`

Compiles a pattern into a Regex object.

fn compile(pattern string) !Regex

`find`

Finds the first match in the text. Returns none if no match is found.

fn (r Regex) find(text string) ?Match

`find_all`

Returns all non-overlapping matches in a string.

fn (r Regex) find_all(text string) []Match

`replace`

Replaces the first match in text with repl. Supports backreferences like $1, $2.

fn (r Regex) replace(text string, repl string) string

`change_stack_depth`

Updates the maximum backtracking depth for the VM. Default is 1024. Use this if your pattern is extremely complex and returns none prematurely.

fn (mut r Regex) change_stack_depth(depth int)

Named Groups Example

import regex.pcre

fn main() {
    r := pcre.compile(r'(?P<year>\d{4})-(?P<month>\d{2})')!
    m := r.find('Date: 2026-02') or { return }

    year := r.group_by_name(m, 'year')
    month := r.group_by_name(m, 'month')
    println('Year: ${year}, Month: ${month}') // Year: 2026, Month: 02
}

PCRE Compatibility Layer

To facilitate easier migration from other engines, a compatibility layer is provided:

Function	Equivalent To
`new_regex(pattern, flags)`	`compile(pattern)`
`r.match_str(text, start, flags)`	`r.find_from(text, start)`
`m.get(idx)`	Retrieves match text (`0`) or capture group (`1+`).
`m.get_all()`	Returns `[full_match, group1, group2, ...]`

Example:

import regex.pcre

r := pcre.new_regex(r'(\w+) (\w+)', 0)!
if m := r.match_str('hello world', 0, 0) {
    println(m.get(0)?) // "hello world"
    println(m.get(1)?) // "hello"
    println(m.get(2)?) // "world"
}

Performance Note

Here is a clear summary of the optimizations implemented in the code:

Raw Pointer Access: The VM bypasses standard array bounds checking by using unsafepointer arithmetic for both the instruction set and the string text, significantly speeding up the hot loop.* Zero-Allocation Search: The Machine struct pre-allocates the backtracking stack andcapture arrays, ensuring that running a search (finding a match) creates no new heap allocations (garbage collection pressure is zero).* Fast ASCII Path: The code checks if a byte is < 128 before decoding. If it is ASCII, itskips the expensive UTF-8 decoding logic entirely.* Bitmap Class Lookups: Character classes (like \w, \d, [a-z]) use a 128-bit bitset.Checking if an ASCII character matches a class is a single O(1) bitwise operation.* Instruction Merging: The compiler groups consecutive literal characters into a singlestring instruction (e.g., a, b, c becomes "abc"), reducing the number of VM cycles required.* Prefix Skipping: If a pattern starts with a literal string, the engine scans ahead forthat substring (Boyer-Moore style) before initializing the VM, avoiding useless execution.* Anchored Optimization: If the pattern starts with ^, the engine only attempts a match atthe start of the string (or line), skipping the character-by-character scan of the rest of the text.

fn compile #

fn compile(pattern string) !Regex

compile transforms a regex pattern string into a Regex object.

fn new_regex #

fn new_regex(pattern string, _ int) !Regex

new_regex is a helper wrapper to compile a regex pattern.

struct Machine #

struct Machine {
mut:
	stack    []int // Backtracking stack stores: [capture_states..., string_ptr, next_pc]
	captures []int // Flat array of [start, end] byte indices for groups
}

Machine provides a workspace for VM execution. To ensure thread safety, this is created per top-level API call.

struct Match #

struct Match {
pub:
	text   string   // The full text of the match
	start  int      // Starting byte index in the source string
	end    int      // Ending byte index in the source string
	groups []string // Sub-strings captured by groups
}

Match contains the results of a successful regex match.

fn (Match) get #

fn (m Match) get(idx int) ?string

get returns the matched text for a specific group index. Index 0 returns the full match, 1..n returns capture groups.

fn (Match) get_all #

fn (m Match) get_all() []string

get_all returns a list of all captured strings, starting with the full match at index 0.

struct Regex #

struct Regex {
pub:
	pattern      string         // The original regex string
	prog         []Inst         // Compiled bytecode
	total_groups int            // Number of capture groups defined in pattern
	group_map    map[string]int // Mapping of names to indices for (?P...)
	prefix_lit   string         // Pre-calculated literal prefix for fast-skip optimization
	has_prefix   bool           // Whether a literal prefix exists
	anchored     bool           // True if pattern starts with '^' (optimization hint)
pub mut:
	max_stack_depth int // User-defined stack limit hint
}

Regex is the compiled regular expression object.

fn (Regex) new_machine #

fn (r &Regex) new_machine() Machine

new_machine allocates a new VM state machine. This isolates the runtime memory (stack/captures) from the compiled regex, allowing thread-safe usage.

fn (Regex) find #

fn (r &Regex) find(text string) ?Match

find returns the first match of the regex in text.

fn (Regex) find_from #

fn (r &Regex) find_from(text string, start_index int) ?Match

find_from returns the first match starting from start_index. Optimized with fast prefix skipping and anchor checks.

fn (Regex) find_all #

fn (r &Regex) find_all(text string) []Match

find_all returns all non-overlapping matches in text.

fn (Regex) replace #

fn (r &Regex) replace(text string, repl string) string

replace finds the first match and replaces it using repl. Supports $1, $2 backreferences.

fn (Regex) fullmatch #

fn (r &Regex) fullmatch(text string) ?Match

fullmatch returns a Match only if the pattern matches the entire text.

fn (Regex) group_by_name #

fn (r &Regex) group_by_name(m Match, name string) string

group_by_name retrieves text captured by a named group.

fn (Regex) match_str #

fn (r &Regex) match_str(text string, start_index int, _ int) ?Match

match_str is a compatibility alias for find_from.

README
fn compile
fn new_regex
struct Machine
struct Match
- fn get
- fn get_all
struct Regex