6 releases (breaking)
| 0.5.1 | Dec 1, 2025 |
|---|---|
| 0.5.0 | Sep 19, 2025 |
| 0.4.0 | Sep 17, 2025 |
| 0.3.0 | Sep 16, 2025 |
| 0.1.0 | Sep 13, 2025 |
#2238 in Text processing
Used in 2 crates
470KB
11K
SLoC
scribe-patterns
Advanced pattern matching and filtering for Scribe repository analysis.
Overview
scribe-patterns provides sophisticated pattern matching capabilities for file selection, filtering, and search operations. It handles glob patterns, regex matching, .gitignore semantics, and custom ignore rules with high performance and correct edge case handling.
Key Features
Glob Pattern Matching
- Standard glob syntax:
*,**,?,[abc],{a,b,c} - Directory-aware matching: Handles
**/for recursive directory traversal - Negative patterns:
!patternto exclude specific files - Case sensitivity control: Case-insensitive matching on Windows by default
Gitignore Semantics
.gitignoreparsing: Full compatibility with Git's ignore rules- Directory negation: Properly handles
!negation patterns - Relative vs absolute paths: Distinguishes
/patternfrompattern - Trailing slashes: Directory-only patterns with
/ - Comment support: Lines starting with
#are ignored
Custom Ignore Files
.scribeignore: Scribe-specific ignore patterns- Multiple ignore files: Hierarchical ignore file processing
- Override precedence: Later patterns override earlier ones
- Inheritance: Child directories inherit parent ignore rules
Performance Optimizations
- Compiled pattern sets: Pre-compile globs into efficient matchers
- Aho-Corasick for literals: Fast multi-pattern matching for literal strings
- Regex caching: Compiled regex patterns are cached
- Early returns: Short-circuit evaluation for common cases
Architecture
Pattern Input → Parser → Compiled Matcher → Match Engine
↓ ↓ ↓ ↓
Glob/Regex Validate globset/regex Apply to Paths
Strings Syntax Compilation Fast Matching
Core Components
PatternSet
Collection of patterns with unified matching interface:
- Globs: File name patterns like
*.rs,**/*.py - Regex: Complex patterns using regular expressions
- Literals: Exact string matches (optimized with Aho-Corasick)
- Negations: Exclude patterns that override includes
IgnoreBuilder
Constructs ignore rule sets from multiple sources:
.gitignorefiles: Standard Git ignore semantics.scribeignorefiles: Scribe-specific patterns- Custom patterns: Programmatically added rules
- Precedence handling: Correct override behavior
PathMatcher
Efficient path matching against pattern sets:
- Compiled matchers: Pre-compiled globset for performance
- Path normalization: Handles Windows vs Unix path separators
- Absolute vs relative: Correct matching for both path types
- Directory detection: Special handling for directory patterns
PatternParser
Parses and validates pattern syntax:
- Glob expansion: Converts globs to regex when needed
- Escape sequence handling: Properly handles
\*,\?, etc. - Error reporting: Clear error messages for invalid patterns
- Syntax validation: Detects malformed patterns early
Usage
Basic Glob Matching
use scribe_patterns::{PatternSet, PathMatcher};
let patterns = PatternSet::from_globs(vec![
"**/*.rs", // All Rust files
"**/*.py", // All Python files
"!**/*_test.py", // Except test files
])?;
let matcher = PathMatcher::new(patterns);
assert!(matcher.is_match("src/main.rs"));
assert!(matcher.is_match("lib/utils.py"));
assert!(!matcher.is_match("lib/utils_test.py")); // Negated
Gitignore-Style Filtering
use scribe_patterns::IgnoreBuilder;
let mut builder = IgnoreBuilder::new("/path/to/repo");
builder.add_gitignore(".gitignore")?;
builder.add_custom("target/**")?; // Exclude Rust build directory
builder.add_custom("!target/debug/important.txt")?; // But include this file
let ignore = builder.build()?;
for entry in walkdir::WalkDir::new("/path/to/repo") {
let entry = entry?;
if ignore.matched(entry.path(), entry.file_type().is_dir()).is_ignore() {
continue; // Skip ignored files
}
// Process file
}
Multiple Pattern Sets
use scribe_patterns::{PatternSet, Matcher};
// Include patterns
let include = PatternSet::from_globs(vec![
"src/**/*.rs",
"lib/**/*.rs",
])?;
// Exclude patterns
let exclude = PatternSet::from_globs(vec![
"**/target/**",
"**/*.bak",
])?;
let matcher = Matcher::new()
.include(include)
.exclude(exclude);
// File must match include AND not match exclude
if matcher.should_include("src/utils.rs") {
// Process file
}
Regex Patterns
use scribe_patterns::PatternSet;
let patterns = PatternSet::from_regex(vec![
r".*_test\.(rs|py)$", // Test files in Rust or Python
r"^src/.*/mod\.rs$", // All mod.rs files in src
])?;
assert!(patterns.is_match("src/utils/mod.rs"));
assert!(patterns.is_match("lib/parser_test.py"));
Case-Insensitive Matching
use scribe_patterns::{PatternSet, MatchOptions};
let patterns = PatternSet::from_globs(vec!["*.TXT", "*.Md"])?;
let options = MatchOptions {
case_sensitive: false,
..Default::default()
};
let matcher = PathMatcher::new(patterns).with_options(options);
assert!(matcher.is_match("readme.md")); // Matches *.Md
assert!(matcher.is_match("notes.txt")); // Matches *.TXT
Pattern Syntax
Glob Patterns
| Pattern | Matches | Example |
|---|---|---|
* |
Any string (not /) |
*.rs → main.rs, lib.rs |
** |
Any path segment | **/*.py → a/b/c.py |
? |
Single character | ?.txt → a.txt, 1.txt |
[abc] |
Character set | [abc].rs → a.rs, b.rs |
{a,b} |
Alternatives | *.{rs,py} → main.rs, util.py |
!pattern |
Negation | !test*.py → exclude test files |
Gitignore Rules
| Pattern | Behavior |
|---|---|
pattern |
Matches in any directory |
/pattern |
Matches only at root |
dir/ |
Matches directory only |
!pattern |
Negates previous patterns |
#comment |
Ignored line |
Special Cases
- Empty patterns: Ignored (no effect)
- Whitespace: Leading/trailing whitespace is trimmed
- Backslash escapes:
\*matches literal* - Unicode: Full UTF-8 support for paths and patterns
Performance
Benchmarks
Pattern compilation and matching is highly optimized:
- Glob compilation: <1ms for typical pattern sets (10-50 patterns)
- Path matching: <1μs per path for compiled matchers
- Literal matching: <100ns using Aho-Corasick for large literal sets
- Regex matching: ~1-10μs depending on pattern complexity
Optimizations
- Lazy compilation: Patterns compiled only when first used
- Caching: Compiled matchers cached in
OnceCell - Fast paths: Literal string matching before expensive regex
- Set operations: Boolean algebra simplification for pattern sets
- Aho-Corasick: Multi-pattern matching for literals in O(n) time
Configuration
MatchOptions
| Field | Type | Default | Description |
|---|---|---|---|
case_sensitive |
bool |
Platform | Match case-sensitively |
require_literal_separator |
bool |
true |
* doesn't match / |
require_literal_leading_dot |
bool |
true |
* doesn't match .hidden |
IgnoreOptions
| Field | Type | Default | Description |
|---|---|---|---|
hidden |
bool |
true |
Ignore hidden files (.file) |
parents |
bool |
true |
Check parent .gitignore files |
git_global |
bool |
false |
Use Git global ignore |
git_exclude |
bool |
false |
Use .git/info/exclude |
Error Handling
All pattern operations return Result<T, PatternError>:
pub enum PatternError {
InvalidGlob(String), // Malformed glob syntax
InvalidRegex(String), // Malformed regex pattern
IoError(io::Error), // File read errors
EmptyPatternSet, // No patterns provided
}
Integration
scribe-patterns is used throughout Scribe:
- scribe-scanner: Filters files during repository traversal
- scribe-analysis: Selects files for AST parsing
- scribe-selection: Applies include/exclude rules to selection
- CLI: Processes
--includeand--excludeflags
See Also
scribe-scanner: Repository scanning and filteringscribe-selection: File selection using patternsscribe-core: Shared types and configuration- globset documentation: Underlying glob implementation
Dependencies
~59MB
~1.5M SLoC