Vexy Glob fast file finding
Project description
vexy_glob - Path Accelerated Finding in Rust
vexy_glob is a high-performance Python extension for file system traversal and content searching, built with Rust. It provides a faster and more feature-rich alternative to Python's built-in glob (up to 6x faster) and pathlib (up to 12x faster) modules.
TL;DR
Installation:
pip install vexy_glob
Quick Start:
Find all Python files in the current directory and its subdirectories:
import vexy_glob
for path in vexy_glob.find("**/*.py"):
print(path)
Find all files containing the text "import asyncio":
for match in vexy_glob.find("**/*.py", content="import asyncio"):
print(f"{match.path}:{match.line_number}: {match.line_text}")
What is vexy_glob?
vexy_glob is a Python library that provides a powerful and efficient way to find files and search for content within them. It's built on top of the excellent Rust crates ignore (for file traversal) and grep-searcher (for content searching), which are the same engines powering tools like fd and ripgrep.
This means you get the speed and efficiency of Rust, with the convenience and ease of use of Python.
Architecture Overview
┌─────────────────────┐
│ Python API Layer │ ← Your Python code calls vexy_glob.find()
├─────────────────────┤
│ PyO3 Bindings │ ← Zero-copy conversions between Python/Rust
├─────────────────────┤
│ Rust Core Engine │ ← GIL released for true parallelism
│ ┌───────────────┐ │
│ │ ignore crate │ │ ← Parallel directory traversal
│ │ (from fd) │ │ Respects .gitignore files
│ └───────────────┘ │
│ ┌───────────────┐ │
│ │ grep-searcher │ │ ← High-speed content search
│ │ (from ripgrep)│ │ SIMD-accelerated regex
│ └───────────────┘ │
├─────────────────────┤
│ Streaming Channel │ ← Results yielded as found
│ (crossbeam-channel) │ No memory accumulation
└─────────────────────┘
Key Features
- 🚀 Blazing Fast: 10-100x faster than Python's
globandpathlibfor many use cases. - ⚡ Streaming Results: Get the first results in milliseconds, without waiting for the entire file system scan to complete.
- 💾 Memory Efficient:
vexy_globuses constant memory, regardless of the number of files or results. - 🔥 Parallel Execution: Utilizes all your CPU cores to get the job done as quickly as possible.
- 🔍 Content Searching: Ripgrep-style content searching with regex support.
- 🎯 Rich Filtering: Filter files by size, modification time, and more.
- 🧠 Smart Defaults: Automatically respects
.gitignorefiles and skips hidden files and directories. - 🌍 Cross-Platform: Works on Linux, macOS, and Windows.
Feature Comparison
| Feature | glob.glob() |
pathlib |
vexy_glob |
|---|---|---|---|
| Pattern matching | ✅ Basic | ✅ Basic | ✅ Advanced |
| Recursive search | ✅ Slow | ✅ Slow | ✅ Fast |
| Streaming results | ❌ | ❌ | ✅ |
| Content search | ❌ | ❌ | ✅ |
| .gitignore respect | ❌ | ❌ | ✅ |
| Parallel execution | ❌ | ❌ | ✅ |
| Size filtering | ❌ | ❌ | ✅ |
| Time filtering | ❌ | ❌ | ✅ |
| Memory efficiency | ❌ | ❌ | ✅ |
How it Works
vexy_glob uses a Rust-powered backend to perform the heavy lifting of file system traversal and content searching. The Rust extension releases Python's Global Interpreter Lock (GIL), allowing for true parallelism and a significant performance boost.
Results are streamed back to Python as they are found, using a producer-consumer architecture with crossbeam channels. This means you can start processing results immediately, without having to wait for the entire search to finish.
Why use vexy_glob?
If you find yourself writing scripts that need to find files based on patterns, or search for content within files, vexy_glob can be a game-changer. It's particularly useful for:
- Large codebases: Quickly find files or code snippets in large projects.
- Log file analysis: Search through gigabytes of logs in seconds.
- Data processing pipelines: Efficiently find and process files based on various criteria.
- Build systems: Fast dependency scanning and file collection.
- Data science: Quickly locate and process data files.
- DevOps: Log analysis, configuration management, deployment scripts.
- Testing: Find test files, fixtures, and coverage reports.
- Anywhere you need to find files fast!
When to Use vexy_glob vs Alternatives
| Use Case | Best Tool | Why |
|---|---|---|
| Simple pattern in small directory | glob.glob() |
Built-in, no dependencies |
| Large directory, need first result fast | vexy_glob |
Streaming results |
| Search file contents | vexy_glob |
Integrated content search |
| Complex filtering (size, time, etc.) | vexy_glob |
Rich filtering API |
| Cross-platform scripts | vexy_glob |
Consistent behavior |
| Git-aware file finding | vexy_glob |
Respects .gitignore |
| Memory-constrained environment | vexy_glob |
Constant memory usage |
Installation and Usage
Python Library
Install vexy_glob using pip:
pip install vexy_glob
Then use it in your Python code:
import vexy_glob
# Find all Python files
for path in vexy_glob.find("**/*.py"):
print(path)
Command-Line Interface
vexy_glob also provides a powerful command-line interface for finding files and searching content directly from your terminal.
Finding Files
Use vexy_glob find to locate files matching glob patterns:
# Find all Python files
vexy_glob find "**/*.py"
# Find all markdown files larger than 10KB
vexy_glob find "**/*.md" --min-size 10k
# Find all log files modified in the last 2 days
vexy_glob find "*.log" --mtime-after -2d
# Find only directories
vexy_glob find "*" --type d
# Include hidden files
vexy_glob find "*" --hidden
# Limit search depth
vexy_glob find "**/*.txt" --depth 2
Searching Content
Use vexy_glob search to find content within files:
# Search for "import asyncio" in Python files
vexy_glob search "**/*.py" "import asyncio"
# Search for function definitions using regex
vexy_glob search "src/**/*.rs" "fn\\s+\\w+"
# Search without color output (for piping)
vexy_glob search "**/*.md" "TODO|FIXME" --no-color
# Case-sensitive search
vexy_glob search "*.txt" "Error" --case-sensitive
# Search with size filters
vexy_glob search "**/*.log" "ERROR" --min-size 1M --max-size 100M
# Search recent files only
vexy_glob search "**/*.py" "TODO" --mtime-after -7d
# Complex search with multiple filters
vexy_glob search "src/**/*.{py,js}" "console\.log|print\(" \
--exclude "*test*" \
--mtime-after -30d \
--max-size 50k
Command-Line Options Reference
Common options for both find and search:
| Option | Type | Description | Example |
|---|---|---|---|
--root |
PATH | Root directory to start search | --root /home/user/projects |
--min-size |
SIZE | Minimum file size | --min-size 10k |
--max-size |
SIZE | Maximum file size | --max-size 5M |
--mtime-after |
TIME | Modified after this time | --mtime-after -7d |
--mtime-before |
TIME | Modified before this time | --mtime-before 2024-01-01 |
--atime-after |
TIME | Accessed after this time | --atime-after -1h |
--atime-before |
TIME | Accessed before this time | --atime-before -30d |
--ctime-after |
TIME | Created after this time | --ctime-after -1w |
--ctime-before |
TIME | Created before this time | --ctime-before -1y |
--no-gitignore |
FLAG | Don't respect .gitignore | --no-gitignore |
--hidden |
FLAG | Include hidden files | --hidden |
--case-sensitive |
FLAG | Force case sensitivity | --case-sensitive |
--type |
CHAR | File type (f/d/l) | --type f |
--extension |
STR | File extension(s) | --extension py |
--exclude |
PATTERN | Exclude patterns | --exclude "*test*" |
--depth |
INT | Maximum directory depth | --depth 3 |
--follow-symlinks |
FLAG | Follow symbolic links | --follow-symlinks |
Additional options for search:
| Option | Type | Description | Example |
|---|---|---|---|
--no-color |
FLAG | Disable colored output | --no-color |
Size format examples:
- Bytes:
1024or"1024" - Kilobytes:
10k,10K,10kb,10KB - Megabytes:
5m,5M,5mb,5MB - Gigabytes:
2g,2G,2gb,2GB - With decimals:
1.5M,2.7G,0.5K
Time format examples:
- Relative:
-30s,-5m,-2h,-7d,-2w,-1mo,-1y - ISO date:
2024-01-01,2024-01-01T10:30:00 - Natural:
yesterday,today(converted to ISO dates)
Unix Pipeline Integration
vexy_glob works seamlessly with Unix pipelines:
# Count Python files
vexy_glob find "**/*.py" | wc -l
# Find Python files containing "async" and edit them
vexy_glob search "**/*.py" "async" --no-color | cut -d: -f1 | sort -u | xargs $EDITOR
# Find large log files and show their sizes
vexy_glob find "*.log" --min-size 100M | xargs ls -lh
# Search for TODOs and format as tasks
vexy_glob search "**/*.py" "TODO" --no-color | awk -F: '{print "- [ ] " $1 ":" $2 ": " $3}'
# Find duplicate file names
vexy_glob find "**/*" --type f | xargs -n1 basename | sort | uniq -d
# Create archive of recent changes
vexy_glob find "**/*" --mtime-after -7d --type f | tar -czf recent_changes.tar.gz -T -
# Find and replace across files
vexy_glob search "**/*.py" "OldClassName" --no-color | cut -d: -f1 | sort -u | xargs sed -i 's/OldClassName/NewClassName/g'
# Generate ctags for Python files
vexy_glob find "**/*.py" | ctags -L -
# Find empty directories
vexy_glob find "**" --type d | while read dir; do [ -z "$(ls -A "$dir")" ] && echo "$dir"; done
# Calculate total size of Python files
vexy_glob find "**/*.py" --type f | xargs stat -f%z | awk '{s+=$1} END {print s}' | numfmt --to=iec
Advanced CLI Patterns
# Monitor for file changes (poor man's watch)
while true; do
clear
echo "Files modified in last minute:"
vexy_glob find "**/*" --mtime-after -1m --type f
sleep 10
done
# Parallel processing with GNU parallel
vexy_glob find "**/*.jpg" | parallel -j4 convert {} {.}_thumb.jpg
# Create a file manifest with checksums
vexy_glob find "**/*" --type f | while read -r file; do
echo "$(sha256sum "$file" | cut -d' ' -f1) $file"
done > manifest.txt
# Find files by content and show context
vexy_glob search "**/*.py" "class.*Error" --no-color | while IFS=: read -r file line rest; do
echo "\n=== $file:$line ==="
sed -n "$((line-2)),$((line+2))p" "$file"
done
Detailed Python API Reference
Core Functions
Core Functions
vexy_glob.find()
The main function for finding files and searching content.
Basic Syntax
def find(
pattern: str = "*",
root: Union[str, Path] = ".",
*,
content: Optional[str] = None,
file_type: Optional[str] = None,
extension: Optional[Union[str, List[str]]] = None,
max_depth: Optional[int] = None,
min_depth: int = 0,
min_size: Optional[int] = None,
max_size: Optional[int] = None,
mtime_after: Optional[Union[float, int, str, datetime]] = None,
mtime_before: Optional[Union[float, int, str, datetime]] = None,
atime_after: Optional[Union[float, int, str, datetime]] = None,
atime_before: Optional[Union[float, int, str, datetime]] = None,
ctime_after: Optional[Union[float, int, str, datetime]] = None,
ctime_before: Optional[Union[float, int, str, datetime]] = None,
hidden: bool = False,
ignore_git: bool = False,
case_sensitive: Optional[bool] = None,
follow_symlinks: bool = False,
threads: Optional[int] = None,
as_path: bool = False,
as_list: bool = False,
exclude: Optional[Union[str, List[str]]] = None,
) -> Union[Iterator[Union[str, Path, SearchResult]], List[Union[str, Path, SearchResult]]]:
"""Find files matching pattern with optional content search.
Args:
pattern: Glob pattern to match files (e.g., "**/*.py", "src/*.js")
root: Root directory to start search from
content: Regex pattern to search within files
file_type: Filter by type - 'f' (file), 'd' (directory), 'l' (symlink)
extension: File extension(s) to filter by (e.g., "py" or ["py", "pyi"])
max_depth: Maximum directory depth to search
min_depth: Minimum directory depth to search
min_size: Minimum file size in bytes (or use parse_size())
max_size: Maximum file size in bytes
mtime_after: Files modified after this time
mtime_before: Files modified before this time
atime_after: Files accessed after this time
atime_before: Files accessed before this time
ctime_after: Files created after this time
ctime_before: Files created before this time
hidden: Include hidden files and directories
ignore_git: Don't respect .gitignore files
case_sensitive: Case sensitivity (None = smart case)
follow_symlinks: Follow symbolic links
threads: Number of threads (None = auto)
as_path: Return Path objects instead of strings
as_list: Return list instead of iterator
exclude: Patterns to exclude from results
Returns:
Iterator or list of file paths (or SearchResult if content is specified)
"""
Basic Examples
import vexy_glob
# Find all Python files
for path in vexy_glob.find("**/*.py"):
print(path)
# Find all files in the 'src' directory
for path in vexy_glob.find("src/**/*"):
print(path)
# Get results as a list instead of iterator
python_files = vexy_glob.find("**/*.py", as_list=True)
print(f"Found {len(python_files)} Python files")
# Get results as Path objects
from pathlib import Path
for path in vexy_glob.find("**/*.md", as_path=True):
print(path.stem) # Path object methods available
Content Searching
To search for content within files, use the content parameter. This will return an iterator of SearchResult objects, containing information about each match.
import vexy_glob
for match in vexy_glob.find("*.py", content="import requests"):
print(f"Found a match in {match.path} on line {match.line_number}:")
print(f" {match.line_text.strip()}")
SearchResult Object
The SearchResult object has the following attributes:
path: The path to the file containing the match.line_number: The line number of the match (1-indexed).line_text: The text of the line containing the match.matches: A list of matched strings on the line.
Content Search Examples
# Simple text search
for match in vexy_glob.find("**/*.py", content="TODO"):
print(f"{match.path}:{match.line_number}: {match.line_text.strip()}")
# Regex pattern search
for match in vexy_glob.find("**/*.py", content=r"def\s+\w+\(.*\):"):
print(f"Function at {match.path}:{match.line_number}")
# Case-insensitive search
for match in vexy_glob.find("**/*.md", content="python", case_sensitive=False):
print(match.path)
# Multiple pattern search with OR
for match in vexy_glob.find("**/*.py", content="import (os|sys|pathlib)"):
print(f"{match.path}: imports {match.matches}")
Filtering Options
Size Filtering
vexy_glob supports human-readable size formats:
import vexy_glob
# Using parse_size() for readable formats
min_size = vexy_glob.parse_size("10K") # 10 kilobytes
max_size = vexy_glob.parse_size("5.5M") # 5.5 megabytes
for path in vexy_glob.find("**/*", min_size=min_size, max_size=max_size):
print(path)
# Supported formats:
# - Bytes: "1024" or 1024
# - Kilobytes: "10K", "10KB", "10k", "10kb"
# - Megabytes: "5M", "5MB", "5m", "5mb"
# - Gigabytes: "2G", "2GB", "2g", "2gb"
# - Decimal: "1.5M", "2.7G"
Time Filtering
vexy_glob accepts multiple time formats:
import vexy_glob
from datetime import datetime, timedelta
# 1. Relative time formats
for path in vexy_glob.find("**/*.log", mtime_after="-1d"): # Last 24 hours
print(path)
# Supported relative formats:
# - Seconds: "-30s" or "-30"
# - Minutes: "-5m"
# - Hours: "-2h"
# - Days: "-7d"
# - Weeks: "-2w"
# - Months: "-1mo" (30 days)
# - Years: "-1y" (365 days)
# 2. ISO date formats
for path in vexy_glob.find("**/*", mtime_after="2024-01-01"):
print(path)
# Supported ISO formats:
# - Date: "2024-01-01"
# - DateTime: "2024-01-01T10:30:00"
# - With timezone: "2024-01-01T10:30:00Z"
# 3. Python datetime objects
week_ago = datetime.now() - timedelta(weeks=1)
for path in vexy_glob.find("**/*", mtime_after=week_ago):
print(path)
# 4. Unix timestamps
import time
hour_ago = time.time() - 3600
for path in vexy_glob.find("**/*", mtime_after=hour_ago):
print(path)
# Combining time filters
for path in vexy_glob.find(
"**/*.py",
mtime_after="-30d", # Modified within 30 days
mtime_before="-1d" # But not in the last 24 hours
):
print(path)
Type and Extension Filtering
import vexy_glob
# Filter by file type
for path in vexy_glob.find("**/*", file_type="d"): # Directories only
print(f"Directory: {path}")
# File types:
# - "f": Regular files
# - "d": Directories
# - "l": Symbolic links
# Filter by extension
for path in vexy_glob.find("**/*", extension="py"):
print(path)
# Multiple extensions
for path in vexy_glob.find("**/*", extension=["py", "pyi", "pyx"]):
print(path)
Exclusion Patterns
import vexy_glob
# Exclude single pattern
for path in vexy_glob.find("**/*.py", exclude="*test*"):
print(path)
# Exclude multiple patterns
exclusions = [
"**/__pycache__/**",
"**/node_modules/**",
"**/.git/**",
"**/build/**",
"**/dist/**"
]
for path in vexy_glob.find("**/*", exclude=exclusions):
print(path)
# Exclude specific files
for path in vexy_glob.find(
"**/*.py",
exclude=["setup.py", "**/conftest.py", "**/*_test.py"]
):
print(path)
Pattern Matching Guide
Glob Pattern Syntax
| Pattern | Matches | Example |
|---|---|---|
* |
Any characters (except /) |
*.py matches test.py |
** |
Any characters including / |
**/*.py matches src/lib/test.py |
? |
Single character | test?.py matches test1.py |
[seq] |
Character in sequence | test[123].py matches test2.py |
[!seq] |
Character not in sequence | test[!0].py matches test1.py |
{a,b} |
Either pattern a or b | *.{py,js} matches .py and .js files |
Smart Case Detection
By default, vexy_glob uses smart case detection:
- If pattern contains uppercase → case-sensitive
- If pattern is all lowercase → case-insensitive
# Case-insensitive (finds README.md, readme.md, etc.)
vexy_glob.find("readme.md")
# Case-sensitive (only finds README.md)
vexy_glob.find("README.md")
# Force case sensitivity
vexy_glob.find("readme.md", case_sensitive=True)
Drop-in Replacements
vexy_glob provides drop-in replacements for standard library functions:
# Replace glob.glob()
import vexy_glob
files = vexy_glob.glob("**/*.py", recursive=True)
# Replace glob.iglob()
for path in vexy_glob.iglob("**/*.py", recursive=True):
print(path)
# Migration from standard library
# OLD:
import glob
files = glob.glob("**/*.py", recursive=True)
# NEW: Just change the import!
import vexy_glob as glob
files = glob.glob("**/*.py", recursive=True) # 10-100x faster!
Performance
Benchmark Results
Benchmarks on a directory with 100,000 files:
| Operation | glob.glob() |
pathlib |
vexy_glob |
Speedup |
|---|---|---|---|---|
Find all .py files |
15.2s | 18.1s | 0.2s | 76x |
| Time to first result | 15.2s | 18.1s | 0.005s | 3040x |
| Memory usage | 1.2GB | 1.5GB | 45MB | 27x less |
| With .gitignore | N/A | N/A | 0.15s | N/A |
Performance Characteristics
- Linear scaling: Performance scales linearly with file count
- I/O bound: SSD vs HDD makes a significant difference
- Cache friendly: Repeated searches benefit from OS file cache
- Memory constant: Uses ~45MB regardless of result count
Performance Tips
- Use specific patterns:
src/**/*.pyis faster than**/*.py - Limit depth: Use
max_depthwhen you know the structure - Exclude early: Use
excludepatterns to skip large directories - Leverage .gitignore: Default behavior skips ignored files
Cookbook - Real-World Examples
Working with Git Repositories
import vexy_glob
# Find all Python files, respecting .gitignore (default behavior)
for path in vexy_glob.find("**/*.py"):
print(path)
# Include files that are gitignored
for path in vexy_glob.find("**/*.py", ignore_git=True):
print(path)
Finding Large Log Files
import vexy_glob
# Find log files larger than 100MB
for path in vexy_glob.find("**/*.log", min_size=vexy_glob.parse_size("100M")):
size_mb = os.path.getsize(path) / 1024 / 1024
print(f"{path}: {size_mb:.1f}MB")
# Find log files between 10MB and 1GB
for path in vexy_glob.find(
"**/*.log",
min_size=vexy_glob.parse_size("10M"),
max_size=vexy_glob.parse_size("1G")
):
print(path)
Finding Recently Modified Files
import vexy_glob
from datetime import datetime, timedelta
# Files modified in the last 24 hours
for path in vexy_glob.find("**/*", mtime_after="-1d"):
print(path)
# Files modified between 1 and 7 days ago
for path in vexy_glob.find(
"**/*",
mtime_after="-7d",
mtime_before="-1d"
):
print(path)
# Files modified after a specific date
for path in vexy_glob.find("**/*", mtime_after="2024-01-01"):
print(path)
Code Search - Finding TODOs and FIXMEs
import vexy_glob
# Find all TODO comments in Python files
for match in vexy_glob.find("**/*.py", content=r"TODO|FIXME"):
print(f"{match.path}:{match.line_number}: {match.line_text.strip()}")
# Find specific function definitions
for match in vexy_glob.find("**/*.py", content=r"def\s+process_data"):
print(f"Found function at {match.path}:{match.line_number}")
Finding Duplicate Files by Size
import vexy_glob
from collections import defaultdict
# Group files by size to find potential duplicates
size_groups = defaultdict(list)
for path in vexy_glob.find("**/*", file_type="f"):
size = os.path.getsize(path)
if size > 0: # Skip empty files
size_groups[size].append(path)
# Print potential duplicates
for size, paths in size_groups.items():
if len(paths) > 1:
print(f"\nPotential duplicates ({size} bytes):")
for path in paths:
print(f" {path}")
Cleaning Build Artifacts
import vexy_glob
import os
# Find and remove Python cache files
cache_patterns = [
"**/__pycache__/**",
"**/*.pyc",
"**/*.pyo",
"**/.pytest_cache/**",
"**/.mypy_cache/**"
]
for pattern in cache_patterns:
for path in vexy_glob.find(pattern, hidden=True):
if os.path.isfile(path):
os.remove(path)
print(f"Removed: {path}")
elif os.path.isdir(path):
shutil.rmtree(path)
print(f"Removed directory: {path}")
Project Statistics
import vexy_glob
from collections import Counter
import os
# Count files by extension
extension_counts = Counter()
for path in vexy_glob.find("**/*", file_type="f"):
ext = os.path.splitext(path)[1].lower()
if ext:
extension_counts[ext] += 1
# Print top 10 file types
print("Top 10 file types in project:")
for ext, count in extension_counts.most_common(10):
print(f" {ext}: {count} files")
# Advanced statistics
total_size = 0
file_count = 0
largest_file = None
largest_size = 0
for path in vexy_glob.find("**/*", file_type="f"):
size = os.path.getsize(path)
total_size += size
file_count += 1
if size > largest_size:
largest_size = size
largest_file = path
print(f"\nProject Statistics:")
print(f"Total files: {file_count:,}")
print(f"Total size: {total_size / 1024 / 1024:.1f} MB")
print(f"Average file size: {total_size / file_count / 1024:.1f} KB")
print(f"Largest file: {largest_file} ({largest_size / 1024 / 1024:.1f} MB)")
Integration with pandas
import vexy_glob
import pandas as pd
import os
# Create a DataFrame of all Python files with metadata
file_data = []
for path in vexy_glob.find("**/*.py"):
stat = os.stat(path)
file_data.append({
'path': path,
'size': stat.st_size,
'modified': pd.Timestamp(stat.st_mtime, unit='s'),
'lines': sum(1 for _ in open(path, 'r', errors='ignore'))
})
df = pd.DataFrame(file_data)
# Analyze the data
print(f"Total Python files: {len(df)}")
print(f"Total lines of code: {df['lines'].sum():,}")
print(f"Average file size: {df['size'].mean():.0f} bytes")
print(f"\nLargest files:")
print(df.nlargest(5, 'size')[['path', 'size', 'lines']])
Parallel Processing Found Files
import vexy_glob
from concurrent.futures import ProcessPoolExecutor
import os
def process_file(path):
"""Process a single file (e.g., count lines)"""
try:
with open(path, 'r', encoding='utf-8') as f:
return path, sum(1 for _ in f)
except:
return path, 0
# Process all Python files in parallel
with ProcessPoolExecutor() as executor:
# Get all files as a list
files = vexy_glob.find("**/*.py", as_list=True)
# Process in parallel
results = executor.map(process_file, files)
# Collect results
total_lines = 0
for path, lines in results:
total_lines += lines
if lines > 1000:
print(f"Large file: {path} ({lines} lines)")
print(f"\nTotal lines of code: {total_lines:,}")
Migration Guide
Migrating from glob
# OLD: Using glob
import glob
import os
# Find all Python files
files = glob.glob("**/*.py", recursive=True)
# Filter by size manually
large_files = []
for f in files:
if os.path.getsize(f) > 1024 * 1024: # 1MB
large_files.append(f)
# NEW: Using vexy_glob
import vexy_glob
# Find large Python files directly
large_files = vexy_glob.find("**/*.py", min_size=1024*1024, as_list=True)
Migrating from pathlib
# OLD: Using pathlib
from pathlib import Path
# Find all Python files
files = list(Path(".").rglob("*.py"))
# Filter by modification time manually
import datetime
recent = []
for f in files:
if f.stat().st_mtime > (datetime.datetime.now() - datetime.timedelta(days=7)).timestamp():
recent.append(f)
# NEW: Using vexy_glob
import vexy_glob
# Find recent Python files directly
recent = vexy_glob.find("**/*.py", mtime_after="-7d", as_path=True, as_list=True)
Migrating from os.walk
# OLD: Using os.walk
import os
# Find all .txt files
txt_files = []
for root, dirs, files in os.walk("."):
for file in files:
if file.endswith(".txt"):
txt_files.append(os.path.join(root, file))
# NEW: Using vexy_glob
import vexy_glob
# Much simpler and faster!
txt_files = vexy_glob.find("**/*.txt", as_list=True)
Development
This project is built with maturin - a tool for building and publishing Rust-based Python extensions.
Prerequisites
- Python 3.8 or later
- Rust toolchain (install from rustup.rs)
uvfor fast Python package management (optional but recommended)
Setting Up Development Environment
# Clone the repository
git clone https://github.com/vexyart/vexy-glob.git
cd vexy-glob
# Set up a virtual environment (using uv for faster installation)
pip install uv
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install development dependencies
uv sync
# Build the Rust extension in development mode
python sync_version.py # Sync version from git tags to Cargo.toml
maturin develop
# Run tests
pytest tests/
# Run benchmarks
pytest tests/test_benchmarks.py -v --benchmark-only
Building Release Artifacts
The project uses a streamlined build system with automatic versioning from git tags.
Quick Build
# Build both wheel and source distribution
./build.sh
This script will:
- Sync the version from git tags to
Cargo.toml - Build an optimized wheel for your platform
- Build a source distribution (sdist)
- Place all artifacts in the
dist/directory
Manual Build
# Ensure you have the latest tags
git fetch --tags
# Sync version to Cargo.toml
python sync_version.py
# Build wheel (platform-specific)
python -m maturin build --release -o dist/
# Build source distribution
python -m maturin sdist -o dist/
Build System Details
The project uses:
- maturin as the build backend for creating Python wheels from Rust code
- setuptools-scm for automatic versioning based on git tags
- sync_version.py to synchronize versions between git tags and
Cargo.toml
Key files:
pyproject.toml- Python project configuration with maturin as build backendCargo.toml- Rust project configurationsync_version.py- Version synchronization scriptbuild.sh- Convenience build script
Versioning
Versions are managed through git tags:
# Create a new version tag
git tag v1.0.4
git push origin v1.0.4
# Build with the new version
./build.sh
The version will be automatically detected and used for both the Python package and Rust crate.
Project Structure
vexy-glob/
├── src/ # Rust source code
│ ├── lib.rs # Main Rust library with PyO3 bindings
│ └── ...
├── vexy_glob/ # Python package
│ ├── __init__.py # Python API wrapper
│ ├── __main__.py # CLI implementation
│ └── ...
├── tests/ # Python tests
│ ├── test_*.py # Unit and integration tests
│ └── test_benchmarks.py # Performance benchmarks
├── Cargo.toml # Rust project configuration
├── pyproject.toml # Python project configuration
├── sync_version.py # Version synchronization script
└── build.sh # Build automation script
CI/CD
The project uses GitHub Actions for continuous integration:
- Testing on Linux, macOS, and Windows
- Python versions 3.8 through 3.12
- Automatic wheel building for releases
- Cross-platform compatibility testing
Exceptions and Error Handling
Exception Hierarchy
VexyGlobError(Exception)
├── PatternError(VexyGlobError, ValueError)
│ └── Raised for invalid glob patterns
├── SearchError(VexyGlobError, IOError)
│ └── Raised for I/O or permission errors
└── TraversalNotSupportedError(VexyGlobError, NotImplementedError)
└── Raised for unsupported operations
Error Handling Examples
import vexy_glob
from vexy_glob import VexyGlobError, PatternError, SearchError
try:
# Invalid pattern
for path in vexy_glob.find("[invalid"):
print(path)
except PatternError as e:
print(f"Invalid pattern: {e}")
try:
# Permission denied or I/O error
for path in vexy_glob.find("**/*", root="/root"):
print(path)
except SearchError as e:
print(f"Search failed: {e}")
# Handle any vexy_glob error
try:
results = vexy_glob.find("**/*.py", content="[invalid regex")
except VexyGlobError as e:
print(f"Operation failed: {e}")
Platform-Specific Considerations
Windows
- Use forward slashes
/in patterns (automatically converted) - Hidden files: Files with hidden attribute are included with
hidden=True - Case sensitivity: Windows is case-insensitive by default
# Windows-specific examples
import vexy_glob
# These are equivalent on Windows
vexy_glob.find("C:/Users/*/Documents/*.docx")
vexy_glob.find("C:\\Users\\*\\Documents\\*.docx") # Also works
# Find hidden files on Windows
for path in vexy_glob.find("**/*", hidden=True):
print(path)
macOS
.DS_Storefiles are excluded by default (via .gitignore)- Case sensitivity depends on file system (usually case-insensitive)
# macOS-specific examples
import vexy_glob
# Exclude .DS_Store and other macOS metadata
for path in vexy_glob.find("**/*", exclude=["**/.DS_Store", "**/.Spotlight-V100", "**/.Trashes"]):
print(path)
Linux
- Always case-sensitive
- Hidden files start with
. - Respects standard Unix permissions
# Linux-specific examples
import vexy_glob
# Find files in home directory config
for path in vexy_glob.find("~/.config/**/*.conf", hidden=True):
print(path)
Troubleshooting
Common Issues
1. No results found
# Check if you need hidden files
results = list(vexy_glob.find("*"))
if not results:
# Try with hidden files
results = list(vexy_glob.find("*", hidden=True))
# Check if .gitignore is excluding files
results = list(vexy_glob.find("**/*.py", ignore_git=True))
2. Pattern not matching expected files
# Debug pattern matching
import vexy_glob
# Too specific?
print(list(vexy_glob.find("src/lib/test.py"))) # Only exact match
# Use wildcards
print(list(vexy_glob.find("src/**/test.py"))) # Any depth
print(list(vexy_glob.find("src/*/test.py"))) # One level only
3. Content search not finding matches
# Check regex syntax
import vexy_glob
# Wrong: Python regex syntax
results = vexy_glob.find("**/*.py", content=r"import\s+{re,os}")
# Correct: Standard regex
results = vexy_glob.find("**/*.py", content=r"import\s+(re|os)")
# Case sensitivity
results = vexy_glob.find("**/*.py", content="TODO", case_sensitive=False)
4. Performance issues
# Optimize your search
import vexy_glob
# Slow: Searching everything
for path in vexy_glob.find("**/*.py", content="import"):
print(path)
# Fast: Limit scope
for path in vexy_glob.find("src/**/*.py", content="import", max_depth=3):
print(path)
# Use exclusions
for path in vexy_glob.find(
"**/*.py",
exclude=["**/node_modules/**", "**/.venv/**", "**/build/**"]
):
print(path)
Build Issues
If you encounter build issues:
- Rust not found: Install Rust from rustup.rs
- maturin not found: Run
pip install maturin - Version mismatch: Run
python sync_version.pyto sync versions - Import errors: Ensure you've run
maturin developafter changes - Build fails: Check that you have the latest Rust stable toolchain
Debug Mode
import vexy_glob
import logging
# Enable debug logging
logging.basicConfig(level=logging.DEBUG)
# This will show internal operations
for path in vexy_glob.find("**/*.py"):
print(path)
FAQ
Q: Why is vexy_glob so much faster than glob?
A: vexy_glob uses Rust's parallel directory traversal, releases Python's GIL, and streams results as they're found instead of collecting everything first.
Q: Does vexy_glob follow symbolic links?
A: By default, no. Use follow_symlinks=True to enable. Loop detection is built-in.
Q: Can I use vexy_glob with async/await?
A: Yes! Use it with asyncio.to_thread():
import asyncio
import vexy_glob
async def find_files():
return await asyncio.to_thread(
vexy_glob.find, "**/*.py", as_list=True
)
Q: How do I search in multiple directories?
A: Call find() multiple times or use a common parent:
# Option 1: Multiple calls
results = []
for root in ["src", "tests", "docs"]:
results.extend(vexy_glob.find("**/*.py", root=root, as_list=True))
# Option 2: Common parent with specific patterns
results = vexy_glob.find("{src,tests,docs}/**/*.py", as_list=True)
Q: Is the content search as powerful as ripgrep?
A: Yes! It uses the same grep-searcher crate that powers ripgrep, including SIMD optimizations.
Advanced Configuration
Custom Ignore Files
import vexy_glob
# By default, respects .gitignore
for path in vexy_glob.find("**/*.py"):
print(path)
# Also respects .ignore and .fdignore files
# Create .ignore in your project root:
# echo "test_*.py" > .ignore
# Now test files will be excluded
for path in vexy_glob.find("**/*.py"):
print(path) # test_*.py files excluded
Thread Configuration
import vexy_glob
import os
# Auto-detect (default)
for path in vexy_glob.find("**/*.py"):
pass
# Limit threads for CPU-bound operations
for match in vexy_glob.find("**/*.py", content="TODO", threads=2):
pass
# Max parallelism for I/O-bound operations
cpu_count = os.cpu_count() or 4
for path in vexy_glob.find("**/*", threads=cpu_count * 2):
pass
Contributing
We welcome contributions! Here's how to get started:
- Fork the repository
- Create a feature branch (
git checkout -b feature-name) - Make your changes
- Run tests (
pytest tests/) - Format code (
cargo fmtfor Rust,ruff formatfor Python) - Commit with descriptive messages
- Push and open a pull request
Before submitting:
- Ensure all tests pass
- Add tests for new functionality
- Update documentation as needed
- Follow existing code style
Running the Full Test Suite
# Python tests
pytest tests/ -v
# Python tests with coverage
pytest tests/ --cov=vexy_glob --cov-report=html
# Rust tests
cargo test
# Benchmarks
pytest tests/test_benchmarks.py -v --benchmark-only
# Linting
cargo clippy -- -D warnings
ruff check .
API Stability and Versioning
vexy_glob follows Semantic Versioning:
- Major version (1.x.x): Breaking API changes
- Minor version (x.1.x): New features, backwards compatible
- Patch version (x.x.1): Bug fixes only
Stable API Guarantees
The following are guaranteed stable in 1.x:
find()function signature and basic parametersglob()andiglob()compatibility functionsSearchResultobject attributes- Exception hierarchy
- CLI command structure
Experimental Features
Features marked experimental may change:
- Thread count optimization algorithms
- Internal buffer size tuning
- Specific error message text
Performance Tuning Guide
For Maximum Speed
import vexy_glob
# 1. Be specific with patterns
# Slow:
vexy_glob.find("**/*.py")
# Fast:
vexy_glob.find("src/**/*.py")
# 2. Use depth limits when possible
vexy_glob.find("**/*.py", max_depth=3)
# 3. Exclude unnecessary directories
vexy_glob.find(
"**/*.py",
exclude=["**/venv/**", "**/node_modules/**", "**/.git/**"]
)
# 4. Use file type filters
vexy_glob.find("**/*.py", file_type="f") # Skip directories
For Memory Efficiency
# Stream results instead of collecting
# Memory efficient:
for path in vexy_glob.find("**/*"):
process(path) # Process one at a time
# Memory intensive:
all_files = vexy_glob.find("**/*", as_list=True) # Loads all in memory
For I/O Optimization
# Optimize thread count based on storage type
import vexy_glob
# SSD: More threads help
for path in vexy_glob.find("**/*", threads=8):
pass
# HDD: Fewer threads to avoid seek thrashing
for path in vexy_glob.find("**/*", threads=2):
pass
# Network storage: Single thread might be best
for path in vexy_glob.find("**/*", threads=1):
pass
License
This project is licensed under the MIT License. See the LICENSE file for details.
Acknowledgments
- Built on the excellent Rust crates:
ignore- Fast directory traversalgrep-searcher- High-performance text searchglobset- Efficient glob matching
- Inspired by tools like
fdandripgrep - Thanks to the PyO3 team for excellent Python-Rust bindings
Related Projects
fd- A simple, fast alternative tofindripgrep- Recursively search directories for a regex patternwalkdir- Python's built-in directory traversalscandir- Better directory iteration for Python
Happy fast file finding! 🚀
If you find vexy_glob useful, please consider giving it a star on GitHub!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vexy_glob-1.0.9.tar.gz.
File metadata
- Download URL: vexy_glob-1.0.9.tar.gz
- Upload date:
- Size: 268.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e334f8fb78d0e269768c4b8537f699821611c76b2ab62cbdb3c2298715071a08
|
|
| MD5 |
53bb38887335ff4d866f5383e520e427
|
|
| BLAKE2b-256 |
25abbe754b19c7acea5ad55aa5311f4935ce96d38fb9b10b07ec799efefe6597
|
File details
Details for the file vexy_glob-1.0.9-cp38-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: vexy_glob-1.0.9-cp38-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 1.1 MB
- Tags: CPython 3.8+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fafc08862efcea87b309525ba0a47f1c827b969ef55ba7f5bdb9a91b59a9a324
|
|
| MD5 |
4141c35cae85e8712e18b666c684902f
|
|
| BLAKE2b-256 |
52a25d3c74a12fa93f6567e4c3d69c255b7cdcba58e6971f25c7b7672a288f53
|