Skip to content

regex.pcre #

regex.pcre Module Documentation

The regex.pcre module is a high-performance Virtual Machine (VM) based regular expression engine for V.

Key Features

  • Non-recursive VM: Safe execution that avoids stack overflows on complex patterns.
  • Zero-Allocation Search: Uses a pre-allocated Machine workspace for search operations.
  • Fast ASCII Path: Optimized path for characters < 128 to bypass heavy UTF-8 decoding.
  • Bitmap Lookups: ASCII character classes use a 128-bit bitset for $O(1)$ matching.
  • Instruction Merging: Consecutive character matches are mergedinto string blocks for faster execution.- Bitmap lookups: ASCII character classes use a 128-bit bitset for O(1) matching.
  • NFA Virtual Machine: Executes bytecode instructions to simulate pattern matching.
  • Dynamic Stack Growth: Automatically expands the backtracking stack to prevent false negatives.
  • Zero-Allocation Search: Reuses a pre-allocated Machine workspace for search operations.
  • Anchored Optimization: Patterns starting with '^' skip the scanning loop.
  • Prefix Skipping: Uses Boyer-Moore-like skipping for literal prefixes.

Supported Syntax

Feature Syntax Description
Literals abc Matches exact characters (UTF-8 supported).
Wildcard . Matches any character (excluding \n unless (?s) flag is used).
Alternation | Matches the left OR right expression (e.g., cat|dog).
Quantifiers *, +, ? Matches 0+, 1+, or 0-1 times.
Lazy *?, +?, ?? Non-greedy versions of the above.
Repetition {m,n} Matches between m and n times. {m,} for m or more.
Groups (...) Capturing group.
(?:...) Non-capturing group.
(?P<name>...) Named capturing group.
Anchors ^, $ Start/End of string (or line with (?m)).
\b, \B Word boundary and Non-word boundary.
Classes [abc], [^abc] Character set and Negated character set.
[a-z] Range of characters.
\w, \W Word / Non-word ([a-zA-Z0-9_]).
\d, \D Digit / Non-digit.
\s, \S Whitespace / Non-whitespace ( \t\n\r\v\f).
\a, \A Lowercase / Uppercase ASCII character class.
Flags (?i) Case-insensitive matching.
(?m) Multiline mode (^ and $ match start/end of lines).
(?s) Dot-all mode (. matches newlines).

Structs

Regex

The compiled regular expression object.

pub struct Regex {
pub:
    pattern      string         // The original pattern
    prog         []Inst         // Compiled VM bytecode
    total_groups int            // Number of capture groups
    group_map    map[string]int // Map for named groups
}

Match

Represents the result of a successful search.

pub struct Match {
pub:
    text   string   // The full substring that matched
    start  int      // Byte index where match starts
    end    int      // Byte index where match ends
    groups []string // Text captured by each group
}

Core Functions

compile

Compiles a pattern into a Regex object.

fn compile(pattern string) !Regex

find

Finds the first match in the text. Returns none if no match is found.

fn (r Regex) find(text string) ?Match

find_all

Returns all non-overlapping matches in a string.

fn (r Regex) find_all(text string) []Match

replace

Replaces the first match in text with repl. Supports backreferences like $1, $2.

fn (r Regex) replace(text string, repl string) string

change_stack_depth

Updates the maximum backtracking depth for the VM. Default is 1024. Use this if your pattern is extremely complex and returns none prematurely.

fn (mut r Regex) change_stack_depth(depth int)

Named Groups Example

import regex.pcre

fn main() {
    r := pcre.compile(r'(?P<year>\d{4})-(?P<month>\d{2})')!
    m := r.find('Date: 2026-02') or { return }

    year := r.group_by_name(m, 'year')
    month := r.group_by_name(m, 'month')
    println('Year: ${year}, Month: ${month}') // Year: 2026, Month: 02
}

PCRE Compatibility Layer

To facilitate easier migration from other engines, a compatibility layer is provided:

Function Equivalent To
new_regex(pattern, flags) compile(pattern)
r.match_str(text, start, flags) r.find_from(text, start)
m.get(idx) Retrieves match text (0) or capture group (1+).
m.get_all() Returns [full_match, group1, group2, ...]

Example:

import regex.pcre

r := pcre.new_regex(r'(\w+) (\w+)', 0)!
if m := r.match_str('hello world', 0, 0) {
    println(m.get(0)?) // "hello world"
    println(m.get(1)?) // "hello"
    println(m.get(2)?) // "world"
}

Performance Note

Here is a clear summary of the optimizations implemented in the code:

  • Raw Pointer Access: The VM bypasses standard array bounds checking by using unsafepointer arithmetic for both the instruction set and the string text, significantly speeding up the hot loop.* Zero-Allocation Search: The Machine struct pre-allocates the backtracking stack andcapture arrays, ensuring that running a search (finding a match) creates no new heap allocations (garbage collection pressure is zero).* Fast ASCII Path: The code checks if a byte is < 128 before decoding. If it is ASCII, itskips the expensive UTF-8 decoding logic entirely.* Bitmap Class Lookups: Character classes (like \w, \d, [a-z]) use a 128-bit bitset.Checking if an ASCII character matches a class is a single O(1) bitwise operation.* Instruction Merging: The compiler groups consecutive literal characters into a singlestring instruction (e.g., a, b, c becomes "abc"), reducing the number of VM cycles required.* Prefix Skipping: If a pattern starts with a literal string, the engine scans ahead forthat substring (Boyer-Moore style) before initializing the VM, avoiding useless execution.* Anchored Optimization: If the pattern starts with ^, the engine only attempts a match atthe start of the string (or line), skipping the character-by-character scan of the rest of the text.

fn compile #

fn compile(pattern string) !Regex

compile transforms a regex pattern string into a Regex object.

fn new_regex #

fn new_regex(pattern string, _ int) !Regex

new_regex is a helper wrapper to compile a regex pattern.

struct Machine #

struct Machine {
mut:
	stack    []int // Backtracking stack stores: [capture_states..., string_ptr, next_pc]
	captures []int // Flat array of [start, end] byte indices for groups
}

Machine provides a workspace for VM execution. To ensure thread safety, this is created per top-level API call.

struct Match #

struct Match {
pub:
	text   string   // The full text of the match
	start  int      // Starting byte index in the source string
	end    int      // Ending byte index in the source string
	groups []string // Sub-strings captured by groups
}

Match contains the results of a successful regex match.

fn (Match) get #

fn (m Match) get(idx int) ?string

get returns the matched text for a specific group index. Index 0 returns the full match, 1..n returns capture groups.

fn (Match) get_all #

fn (m Match) get_all() []string

get_all returns a list of all captured strings, starting with the full match at index 0.

struct Regex #

struct Regex {
pub:
	pattern      string         // The original regex string
	prog         []Inst         // Compiled bytecode
	total_groups int            // Number of capture groups defined in pattern
	group_map    map[string]int // Mapping of names to indices for (?P...)
	prefix_lit   string         // Pre-calculated literal prefix for fast-skip optimization
	has_prefix   bool           // Whether a literal prefix exists
	anchored     bool           // True if pattern starts with '^' (optimization hint)
pub mut:
	max_stack_depth int // User-defined stack limit hint
}

Regex is the compiled regular expression object.

fn (Regex) new_machine #

fn (r &Regex) new_machine() Machine

new_machine allocates a new VM state machine. This isolates the runtime memory (stack/captures) from the compiled regex, allowing thread-safe usage.

fn (Regex) find #

fn (r &Regex) find(text string) ?Match

find returns the first match of the regex in text.

fn (Regex) find_from #

fn (r &Regex) find_from(text string, start_index int) ?Match

find_from returns the first match starting from start_index. Optimized with fast prefix skipping and anchor checks.

fn (Regex) find_all #

fn (r &Regex) find_all(text string) []Match

find_all returns all non-overlapping matches in text.

fn (Regex) replace #

fn (r &Regex) replace(text string, repl string) string

replace finds the first match and replaces it using repl. Supports $1, $2 backreferences.

fn (Regex) fullmatch #

fn (r &Regex) fullmatch(text string) ?Match

fullmatch returns a Match only if the pattern matches the entire text.

fn (Regex) group_by_name #

fn (r &Regex) group_by_name(m Match, name string) string

group_by_name retrieves text captured by a named group.

fn (Regex) match_str #

fn (r &Regex) match_str(text string, start_index int, _ int) ?Match

match_str is a compatibility alias for find_from.