Vexy Glob fast file finding

These details have not been verified by PyPI

Project links

Project description

vexy_glob - Path Accelerated Finding in Rust

vexy_glob is a high-performance Python extension for file system traversal and content searching, built with Rust. It provides a faster and more feature-rich alternative to Python's built-in glob (up to 6x faster) and pathlib (up to 12x faster) modules.

TL;DR

Installation:

pip install vexy_glob

Quick Start:

Find all Python files in the current directory and its subdirectories:

import vexy_glob

for path in vexy_glob.find("**/*.py"):
    print(path)

Find all files containing the text "import asyncio":

for match in vexy_glob.find("**/*.py", content="import asyncio"):
    print(f"{match.path}:{match.line_number}: {match.line_text}")

What is `vexy_glob`?

vexy_glob is a Python library that provides a powerful and efficient way to find files and search for content within them. It's built on top of the excellent Rust crates ignore (for file traversal) and grep-searcher (for content searching), which are the same engines powering tools like fd and ripgrep.

This means you get the speed and efficiency of Rust, with the convenience and ease of use of Python.

Architecture Overview

┌─────────────────────┐
│   Python API Layer  │  ← Your Python code calls vexy_glob.find()
├─────────────────────┤
│    PyO3 Bindings    │  ← Zero-copy conversions between Python/Rust
├─────────────────────┤
│  Rust Core Engine   │  ← GIL released for true parallelism
│  ┌───────────────┐  │
│  │ ignore crate  │  │  ← Parallel directory traversal
│  │ (from fd)     │  │     Respects .gitignore files
│  └───────────────┘  │
│  ┌───────────────┐  │
│  │ grep-searcher │  │  ← High-speed content search
│  │ (from ripgrep)│  │     SIMD-accelerated regex
│  └───────────────┘  │
├─────────────────────┤
│ Streaming Channel   │  ← Results yielded as found
│ (crossbeam-channel) │     No memory accumulation
└─────────────────────┘

Key Features

🚀 Blazing Fast: 10-100x faster than Python's glob and pathlib for many use cases.
⚡ Streaming Results: Get the first results in milliseconds, without waiting for the entire file system scan to complete.
💾 Memory Efficient: vexy_glob uses constant memory, regardless of the number of files or results.
🔥 Parallel Execution: Utilizes all your CPU cores to get the job done as quickly as possible.
🔍 Content Searching: Ripgrep-style content searching with regex support.
🎯 Rich Filtering: Filter files by size, modification time, and more.
🧠 Smart Defaults: Automatically respects .gitignore files and skips hidden files and directories.
🌍 Cross-Platform: Works on Linux, macOS, and Windows.

Feature Comparison

Feature	`glob.glob()`	`pathlib`	`vexy_glob`
Pattern matching	✅ Basic	✅ Basic	✅ Advanced
Recursive search	✅ Slow	✅ Slow	✅ Fast
Streaming results	❌	❌	✅
Content search	❌	❌	✅
.gitignore respect	❌	❌	✅
Parallel execution	❌	❌	✅
Size filtering	❌	❌	✅
Time filtering	❌	❌	✅
Memory efficiency	❌	❌	✅

How it Works

vexy_glob uses a Rust-powered backend to perform the heavy lifting of file system traversal and content searching. The Rust extension releases Python's Global Interpreter Lock (GIL), allowing for true parallelism and a significant performance boost.

Results are streamed back to Python as they are found, using a producer-consumer architecture with crossbeam channels. This means you can start processing results immediately, without having to wait for the entire search to finish.

Why use `vexy_glob`?

If you find yourself writing scripts that need to find files based on patterns, or search for content within files, vexy_glob can be a game-changer. It's particularly useful for:

Large codebases: Quickly find files or code snippets in large projects.
Log file analysis: Search through gigabytes of logs in seconds.
Data processing pipelines: Efficiently find and process files based on various criteria.
Build systems: Fast dependency scanning and file collection.
Data science: Quickly locate and process data files.
DevOps: Log analysis, configuration management, deployment scripts.
Testing: Find test files, fixtures, and coverage reports.
Anywhere you need to find files fast!

When to Use vexy_glob vs Alternatives

Use Case	Best Tool	Why
Simple pattern in small directory	`glob.glob()`	Built-in, no dependencies
Large directory, need first result fast	`vexy_glob`	Streaming results
Search file contents	`vexy_glob`	Integrated content search
Complex filtering (size, time, etc.)	`vexy_glob`	Rich filtering API
Cross-platform scripts	`vexy_glob`	Consistent behavior
Git-aware file finding	`vexy_glob`	Respects .gitignore
Memory-constrained environment	`vexy_glob`	Constant memory usage

Installation and Usage

Python Library

Install vexy_glob using pip:

pip install vexy_glob

Then use it in your Python code:

import vexy_glob

# Find all Python files
for path in vexy_glob.find("**/*.py"):
    print(path)

Command-Line Interface

vexy_glob also provides a powerful command-line interface for finding files and searching content directly from your terminal.

Finding Files

Use vexy_glob find to locate files matching glob patterns:

# Find all Python files
vexy_glob find "**/*.py"

# Find all markdown files larger than 10KB
vexy_glob find "**/*.md" --min-size 10k

# Find all log files modified in the last 2 days
vexy_glob find "*.log" --mtime-after -2d

# Find only directories
vexy_glob find "*" --type d

# Include hidden files
vexy_glob find "*" --hidden

# Limit search depth
vexy_glob find "**/*.txt" --depth 2

Searching Content

Use vexy_glob search to find content within files:

# Search for "import asyncio" in Python files
vexy_glob search "**/*.py" "import asyncio"

# Search for function definitions using regex
vexy_glob search "src/**/*.rs" "fn\\s+\\w+"

# Search without color output (for piping)
vexy_glob search "**/*.md" "TODO|FIXME" --no-color

# Case-sensitive search
vexy_glob search "*.txt" "Error" --case-sensitive

# Search with size filters
vexy_glob search "**/*.log" "ERROR" --min-size 1M --max-size 100M

# Search recent files only
vexy_glob search "**/*.py" "TODO" --mtime-after -7d

# Complex search with multiple filters
vexy_glob search "src/**/*.{py,js}" "console\.log|print\(" \
    --exclude "*test*" \
    --mtime-after -30d \
    --max-size 50k

Command-Line Options Reference

Common options for both find and search:

Option	Type	Description	Example
`--root`	PATH	Root directory to start search	`--root /home/user/projects`
`--min-size`	SIZE	Minimum file size	`--min-size 10k`
`--max-size`	SIZE	Maximum file size	`--max-size 5M`
`--mtime-after`	TIME	Modified after this time	`--mtime-after -7d`
`--mtime-before`	TIME	Modified before this time	`--mtime-before 2024-01-01`
`--atime-after`	TIME	Accessed after this time	`--atime-after -1h`
`--atime-before`	TIME	Accessed before this time	`--atime-before -30d`
`--ctime-after`	TIME	Created after this time	`--ctime-after -1w`
`--ctime-before`	TIME	Created before this time	`--ctime-before -1y`
`--no-gitignore`	FLAG	Don't respect .gitignore	`--no-gitignore`
`--hidden`	FLAG	Include hidden files	`--hidden`
`--case-sensitive`	FLAG	Force case sensitivity	`--case-sensitive`
`--type`	CHAR	File type (f/d/l)	`--type f`
`--extension`	STR	File extension(s)	`--extension py`
`--exclude`	PATTERN	Exclude patterns	`--exclude "test"`
`--depth`	INT	Maximum directory depth	`--depth 3`
`--follow-symlinks`	FLAG	Follow symbolic links	`--follow-symlinks`

Additional options for search:

Option	Type	Description	Example
`--no-color`	FLAG	Disable colored output	`--no-color`

Size format examples:

Bytes: 1024 or "1024"
Kilobytes: 10k, 10K, 10kb, 10KB
Megabytes: 5m, 5M, 5mb, 5MB
Gigabytes: 2g, 2G, 2gb, 2GB
With decimals: 1.5M, 2.7G, 0.5K

Time format examples:

Relative: -30s, -5m, -2h, -7d, -2w, -1mo, -1y
ISO date: 2024-01-01, 2024-01-01T10:30:00
Natural: yesterday, today (converted to ISO dates)

Unix Pipeline Integration

vexy_glob works seamlessly with Unix pipelines:

# Count Python files
vexy_glob find "**/*.py" | wc -l

# Find Python files containing "async" and edit them
vexy_glob search "**/*.py" "async" --no-color | cut -d: -f1 | sort -u | xargs $EDITOR

# Find large log files and show their sizes
vexy_glob find "*.log" --min-size 100M | xargs ls -lh

# Search for TODOs and format as tasks
vexy_glob search "**/*.py" "TODO" --no-color | awk -F: '{print "- [ ] " $1 ":" $2 ": " $3}'

# Find duplicate file names
vexy_glob find "**/*" --type f | xargs -n1 basename | sort | uniq -d

# Create archive of recent changes
vexy_glob find "**/*" --mtime-after -7d --type f | tar -czf recent_changes.tar.gz -T -

# Find and replace across files
vexy_glob search "**/*.py" "OldClassName" --no-color | cut -d: -f1 | sort -u | xargs sed -i 's/OldClassName/NewClassName/g'

# Generate ctags for Python files
vexy_glob find "**/*.py" | ctags -L -

# Find empty directories
vexy_glob find "**" --type d | while read dir; do [ -z "$(ls -A "$dir")" ] && echo "$dir"; done

# Calculate total size of Python files
vexy_glob find "**/*.py" --type f | xargs stat -f%z | awk '{s+=$1} END {print s}' | numfmt --to=iec

Advanced CLI Patterns

# Monitor for file changes (poor man's watch)
while true; do
    clear
    echo "Files modified in last minute:"
    vexy_glob find "**/*" --mtime-after -1m --type f
    sleep 10
done

# Parallel processing with GNU parallel
vexy_glob find "**/*.jpg" | parallel -j4 convert {} {.}_thumb.jpg

# Create a file manifest with checksums
vexy_glob find "**/*" --type f | while read -r file; do
    echo "$(sha256sum "$file" | cut -d' ' -f1) $file"
done > manifest.txt

# Find files by content and show context
vexy_glob search "**/*.py" "class.*Error" --no-color | while IFS=: read -r file line rest; do
    echo "\n=== $file:$line ==="
    sed -n "$((line-2)),$((line+2))p" "$file"
done

Detailed Python API Reference

Core Functions

`vexy_glob.find()`

The main function for finding files and searching content.

Basic Syntax

def find(
    pattern: str = "*",
    root: Union[str, Path] = ".",
    *,
    content: Optional[str] = None,
    file_type: Optional[str] = None,
    extension: Optional[Union[str, List[str]]] = None,
    max_depth: Optional[int] = None,
    min_depth: int = 0,
    min_size: Optional[int] = None,
    max_size: Optional[int] = None,
    mtime_after: Optional[Union[float, int, str, datetime]] = None,
    mtime_before: Optional[Union[float, int, str, datetime]] = None,
    atime_after: Optional[Union[float, int, str, datetime]] = None,
    atime_before: Optional[Union[float, int, str, datetime]] = None,
    ctime_after: Optional[Union[float, int, str, datetime]] = None,
    ctime_before: Optional[Union[float, int, str, datetime]] = None,
    hidden: bool = False,
    ignore_git: bool = False,
    case_sensitive: Optional[bool] = None,
    follow_symlinks: bool = False,
    threads: Optional[int] = None,
    as_path: bool = False,
    as_list: bool = False,
    exclude: Optional[Union[str, List[str]]] = None,
) -> Union[Iterator[Union[str, Path, SearchResult]], List[Union[str, Path, SearchResult]]]:
    """Find files matching pattern with optional content search.
    
    Args:
        pattern: Glob pattern to match files (e.g., "**/*.py", "src/*.js")
        root: Root directory to start search from
        content: Regex pattern to search within files
        file_type: Filter by type - 'f' (file), 'd' (directory), 'l' (symlink)
        extension: File extension(s) to filter by (e.g., "py" or ["py", "pyi"])
        max_depth: Maximum directory depth to search
        min_depth: Minimum directory depth to search
        min_size: Minimum file size in bytes (or use parse_size())
        max_size: Maximum file size in bytes
        mtime_after: Files modified after this time
        mtime_before: Files modified before this time
        atime_after: Files accessed after this time
        atime_before: Files accessed before this time
        ctime_after: Files created after this time
        ctime_before: Files created before this time
        hidden: Include hidden files and directories
        ignore_git: Don't respect .gitignore files
        case_sensitive: Case sensitivity (None = smart case)
        follow_symlinks: Follow symbolic links
        threads: Number of threads (None = auto)
        as_path: Return Path objects instead of strings
        as_list: Return list instead of iterator
        exclude: Patterns to exclude from results
    
    Returns:
        Iterator or list of file paths (or SearchResult if content is specified)
    """

Basic Examples

import vexy_glob

# Find all Python files
for path in vexy_glob.find("**/*.py"):
    print(path)

# Find all files in the 'src' directory
for path in vexy_glob.find("src/**/*"):
    print(path)

# Get results as a list instead of iterator
python_files = vexy_glob.find("**/*.py", as_list=True)
print(f"Found {len(python_files)} Python files")

# Get results as Path objects
from pathlib import Path
for path in vexy_glob.find("**/*.md", as_path=True):
    print(path.stem)  # Path object methods available

Content Searching

To search for content within files, use the content parameter. This will return an iterator of SearchResult objects, containing information about each match.

import vexy_glob

for match in vexy_glob.find("*.py", content="import requests"):
    print(f"Found a match in {match.path} on line {match.line_number}:")
    print(f"  {match.line_text.strip()}")

SearchResult Object

The SearchResult object has the following attributes:

path: The path to the file containing the match.
line_number: The line number of the match (1-indexed).
line_text: The text of the line containing the match.
matches: A list of matched strings on the line.

Content Search Examples

# Simple text search
for match in vexy_glob.find("**/*.py", content="TODO"):
    print(f"{match.path}:{match.line_number}: {match.line_text.strip()}")

# Regex pattern search
for match in vexy_glob.find("**/*.py", content=r"def\s+\w+\(.*\):"):
    print(f"Function at {match.path}:{match.line_number}")

# Case-insensitive search
for match in vexy_glob.find("**/*.md", content="python", case_sensitive=False):
    print(match.path)

# Multiple pattern search with OR
for match in vexy_glob.find("**/*.py", content="import (os|sys|pathlib)"):
    print(f"{match.path}: imports {match.matches}")

Filtering Options

Size Filtering

vexy_glob supports human-readable size formats:

import vexy_glob

# Using parse_size() for readable formats
min_size = vexy_glob.parse_size("10K")   # 10 kilobytes
max_size = vexy_glob.parse_size("5.5M")  # 5.5 megabytes

for path in vexy_glob.find("**/*", min_size=min_size, max_size=max_size):
    print(path)

# Supported formats:
# - Bytes: "1024" or 1024
# - Kilobytes: "10K", "10KB", "10k", "10kb"
# - Megabytes: "5M", "5MB", "5m", "5mb"
# - Gigabytes: "2G", "2GB", "2g", "2gb"
# - Decimal: "1.5M", "2.7G"

Time Filtering

vexy_glob accepts multiple time formats:

import vexy_glob
from datetime import datetime, timedelta

# 1. Relative time formats
for path in vexy_glob.find("**/*.log", mtime_after="-1d"):     # Last 24 hours
    print(path)

# Supported relative formats:
# - Seconds: "-30s" or "-30"
# - Minutes: "-5m"
# - Hours: "-2h"
# - Days: "-7d"
# - Weeks: "-2w"
# - Months: "-1mo" (30 days)
# - Years: "-1y" (365 days)

# 2. ISO date formats
for path in vexy_glob.find("**/*", mtime_after="2024-01-01"):
    print(path)

# Supported ISO formats:
# - Date: "2024-01-01"
# - DateTime: "2024-01-01T10:30:00"
# - With timezone: "2024-01-01T10:30:00Z"

# 3. Python datetime objects
week_ago = datetime.now() - timedelta(weeks=1)
for path in vexy_glob.find("**/*", mtime_after=week_ago):
    print(path)

# 4. Unix timestamps
import time
hour_ago = time.time() - 3600
for path in vexy_glob.find("**/*", mtime_after=hour_ago):
    print(path)

# Combining time filters
for path in vexy_glob.find(
    "**/*.py",
    mtime_after="-30d",      # Modified within 30 days
    mtime_before="-1d"       # But not in the last 24 hours
):
    print(path)

Type and Extension Filtering

import vexy_glob

# Filter by file type
for path in vexy_glob.find("**/*", file_type="d"):  # Directories only
    print(f"Directory: {path}")

# File types:
# - "f": Regular files
# - "d": Directories
# - "l": Symbolic links

# Filter by extension
for path in vexy_glob.find("**/*", extension="py"):
    print(path)

# Multiple extensions
for path in vexy_glob.find("**/*", extension=["py", "pyi", "pyx"]):
    print(path)

Exclusion Patterns

import vexy_glob

# Exclude single pattern
for path in vexy_glob.find("**/*.py", exclude="*test*"):
    print(path)

# Exclude multiple patterns
exclusions = [
    "**/__pycache__/**",
    "**/node_modules/**",
    "**/.git/**",
    "**/build/**",
    "**/dist/**"
]
for path in vexy_glob.find("**/*", exclude=exclusions):
    print(path)

# Exclude specific files
for path in vexy_glob.find(
    "**/*.py",
    exclude=["setup.py", "**/conftest.py", "**/*_test.py"]
):
    print(path)

Pattern Matching Guide

Glob Pattern Syntax

Pattern	Matches	Example
`*`	Any characters (except `/`)	`*.py` matches `test.py`
`**`	Any characters including `/`	`*/.py` matches `src/lib/test.py`
`?`	Single character	`test?.py` matches `test1.py`
`[seq]`	Character in sequence	`test[123].py` matches `test2.py`
`[!seq]`	Character not in sequence	`test[!0].py` matches `test1.py`
`{a,b}`	Either pattern a or b	`*.{py,js}` matches `.py` and `.js` files

Smart Case Detection

By default, vexy_glob uses smart case detection:

If pattern contains uppercase → case-sensitive
If pattern is all lowercase → case-insensitive

# Case-insensitive (finds README.md, readme.md, etc.)
vexy_glob.find("readme.md")

# Case-sensitive (only finds README.md)
vexy_glob.find("README.md")

# Force case sensitivity
vexy_glob.find("readme.md", case_sensitive=True)

Drop-in Replacements

vexy_glob provides drop-in replacements for standard library functions:

# Replace glob.glob()
import vexy_glob
files = vexy_glob.glob("**/*.py", recursive=True)

# Replace glob.iglob()
for path in vexy_glob.iglob("**/*.py", recursive=True):
    print(path)

# Migration from standard library
# OLD:
import glob
files = glob.glob("**/*.py", recursive=True)

# NEW: Just change the import!
import vexy_glob as glob
files = glob.glob("**/*.py", recursive=True)  # 10-100x faster!

Performance

Benchmark Results

Benchmarks on a directory with 100,000 files:

Operation	`glob.glob()`	`pathlib`	`vexy_glob`	Speedup
Find all `.py` files	15.2s	18.1s	0.2s	76x
Time to first result	15.2s	18.1s	0.005s	3040x
Memory usage	1.2GB	1.5GB	45MB	27x less
With .gitignore	N/A	N/A	0.15s	N/A

Performance Characteristics

Linear scaling: Performance scales linearly with file count
I/O bound: SSD vs HDD makes a significant difference
Cache friendly: Repeated searches benefit from OS file cache
Memory constant: Uses ~45MB regardless of result count

Performance Tips

Use specific patterns: src/**/*.py is faster than **/*.py
Limit depth: Use max_depth when you know the structure
Exclude early: Use exclude patterns to skip large directories
Leverage .gitignore: Default behavior skips ignored files

Cookbook - Real-World Examples

Working with Git Repositories

import vexy_glob

# Find all Python files, respecting .gitignore (default behavior)
for path in vexy_glob.find("**/*.py"):
    print(path)

# Include files that are gitignored
for path in vexy_glob.find("**/*.py", ignore_git=True):
    print(path)

Finding Large Log Files

import vexy_glob

# Find log files larger than 100MB
for path in vexy_glob.find("**/*.log", min_size=vexy_glob.parse_size("100M")):
    size_mb = os.path.getsize(path) / 1024 / 1024
    print(f"{path}: {size_mb:.1f}MB")

# Find log files between 10MB and 1GB
for path in vexy_glob.find(
    "**/*.log",
    min_size=vexy_glob.parse_size("10M"),
    max_size=vexy_glob.parse_size("1G")
):
    print(path)

Finding Recently Modified Files

import vexy_glob
from datetime import datetime, timedelta

# Files modified in the last 24 hours
for path in vexy_glob.find("**/*", mtime_after="-1d"):
    print(path)

# Files modified between 1 and 7 days ago
for path in vexy_glob.find(
    "**/*",
    mtime_after="-7d",
    mtime_before="-1d"
):
    print(path)

# Files modified after a specific date
for path in vexy_glob.find("**/*", mtime_after="2024-01-01"):
    print(path)

Code Search - Finding TODOs and FIXMEs

import vexy_glob

# Find all TODO comments in Python files
for match in vexy_glob.find("**/*.py", content=r"TODO|FIXME"):
    print(f"{match.path}:{match.line_number}: {match.line_text.strip()}")

# Find specific function definitions
for match in vexy_glob.find("**/*.py", content=r"def\s+process_data"):
    print(f"Found function at {match.path}:{match.line_number}")

Finding Duplicate Files by Size

import vexy_glob
from collections import defaultdict

# Group files by size to find potential duplicates
size_groups = defaultdict(list)

for path in vexy_glob.find("**/*", file_type="f"):
    size = os.path.getsize(path)
    if size > 0:  # Skip empty files
        size_groups[size].append(path)

# Print potential duplicates
for size, paths in size_groups.items():
    if len(paths) > 1:
        print(f"\nPotential duplicates ({size} bytes):")
        for path in paths:
            print(f"  {path}")

Cleaning Build Artifacts

import vexy_glob
import os

# Find and remove Python cache files
cache_patterns = [
    "**/__pycache__/**",
    "**/*.pyc",
    "**/*.pyo",
    "**/.pytest_cache/**",
    "**/.mypy_cache/**"
]

for pattern in cache_patterns:
    for path in vexy_glob.find(pattern, hidden=True):
        if os.path.isfile(path):
            os.remove(path)
            print(f"Removed: {path}")
        elif os.path.isdir(path):
            shutil.rmtree(path)
            print(f"Removed directory: {path}")

Project Statistics

import vexy_glob
from collections import Counter
import os

# Count files by extension
extension_counts = Counter()

for path in vexy_glob.find("**/*", file_type="f"):
    ext = os.path.splitext(path)[1].lower()
    if ext:
        extension_counts[ext] += 1

# Print top 10 file types
print("Top 10 file types in project:")
for ext, count in extension_counts.most_common(10):
    print(f"  {ext}: {count} files")

# Advanced statistics
total_size = 0
file_count = 0
largest_file = None
largest_size = 0

for path in vexy_glob.find("**/*", file_type="f"):
    size = os.path.getsize(path)
    total_size += size
    file_count += 1
    if size > largest_size:
        largest_size = size
        largest_file = path

print(f"\nProject Statistics:")
print(f"Total files: {file_count:,}")
print(f"Total size: {total_size / 1024 / 1024:.1f} MB")
print(f"Average file size: {total_size / file_count / 1024:.1f} KB")
print(f"Largest file: {largest_file} ({largest_size / 1024 / 1024:.1f} MB)")

Integration with pandas

import vexy_glob
import pandas as pd
import os

# Create a DataFrame of all Python files with metadata
file_data = []

for path in vexy_glob.find("**/*.py"):
    stat = os.stat(path)
    file_data.append({
        'path': path,
        'size': stat.st_size,
        'modified': pd.Timestamp(stat.st_mtime, unit='s'),
        'lines': sum(1 for _ in open(path, 'r', errors='ignore'))
    })

df = pd.DataFrame(file_data)

# Analyze the data
print(f"Total Python files: {len(df)}")
print(f"Total lines of code: {df['lines'].sum():,}")
print(f"Average file size: {df['size'].mean():.0f} bytes")
print(f"\nLargest files:")
print(df.nlargest(5, 'size')[['path', 'size', 'lines']])

Parallel Processing Found Files

import vexy_glob
from concurrent.futures import ProcessPoolExecutor
import os

def process_file(path):
    """Process a single file (e.g., count lines)"""
    try:
        with open(path, 'r', encoding='utf-8') as f:
            return path, sum(1 for _ in f)
    except:
        return path, 0

# Process all Python files in parallel
with ProcessPoolExecutor() as executor:
    # Get all files as a list
    files = vexy_glob.find("**/*.py", as_list=True)
    
    # Process in parallel
    results = executor.map(process_file, files)
    
    # Collect results
    total_lines = 0
    for path, lines in results:
        total_lines += lines
        if lines > 1000:
            print(f"Large file: {path} ({lines} lines)")
    
    print(f"\nTotal lines of code: {total_lines:,}")

Migration Guide

Migrating from `glob`

# OLD: Using glob
import glob
import os

# Find all Python files
files = glob.glob("**/*.py", recursive=True)

# Filter by size manually
large_files = []
for f in files:
    if os.path.getsize(f) > 1024 * 1024:  # 1MB
        large_files.append(f)

# NEW: Using vexy_glob
import vexy_glob

# Find large Python files directly
large_files = vexy_glob.find("**/*.py", min_size=1024*1024, as_list=True)

Migrating from `pathlib`

# OLD: Using pathlib
from pathlib import Path

# Find all Python files
files = list(Path(".").rglob("*.py"))

# Filter by modification time manually
import datetime
recent = []
for f in files:
    if f.stat().st_mtime > (datetime.datetime.now() - datetime.timedelta(days=7)).timestamp():
        recent.append(f)

# NEW: Using vexy_glob
import vexy_glob

# Find recent Python files directly
recent = vexy_glob.find("**/*.py", mtime_after="-7d", as_path=True, as_list=True)

Migrating from `os.walk`

# OLD: Using os.walk
import os

# Find all .txt files
txt_files = []
for root, dirs, files in os.walk("."):
    for file in files:
        if file.endswith(".txt"):
            txt_files.append(os.path.join(root, file))

# NEW: Using vexy_glob
import vexy_glob

# Much simpler and faster!
txt_files = vexy_glob.find("**/*.txt", as_list=True)

Development

This project is built with maturin - a tool for building and publishing Rust-based Python extensions.

Prerequisites

Python 3.8 or later
Rust toolchain (install from rustup.rs)
uv for fast Python package management (optional but recommended)

Setting Up Development Environment

# Clone the repository
git clone https://github.com/vexyart/vexy-glob.git
cd vexy-glob

# Set up a virtual environment (using uv for faster installation)
pip install uv
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install development dependencies
uv sync

# Build the Rust extension in development mode
python sync_version.py  # Sync version from git tags to Cargo.toml
maturin develop

# Run tests
pytest tests/

# Run benchmarks
pytest tests/test_benchmarks.py -v --benchmark-only

Building Release Artifacts

The project uses a streamlined build system with automatic versioning from git tags.

Quick Build

# Build both wheel and source distribution
./build.sh

This script will:

Sync the version from git tags to Cargo.toml
Build an optimized wheel for your platform
Build a source distribution (sdist)
Place all artifacts in the dist/ directory

Manual Build

# Ensure you have the latest tags
git fetch --tags

# Sync version to Cargo.toml
python sync_version.py

# Build wheel (platform-specific)
python -m maturin build --release -o dist/

# Build source distribution
python -m maturin sdist -o dist/

Build System Details

The project uses:

maturin as the build backend for creating Python wheels from Rust code
setuptools-scm for automatic versioning based on git tags
sync_version.py to synchronize versions between git tags and Cargo.toml

Key files:

pyproject.toml - Python project configuration with maturin as build backend
Cargo.toml - Rust project configuration
sync_version.py - Version synchronization script
build.sh - Convenience build script

Versioning

Versions are managed through git tags:

# Create a new version tag
git tag v1.0.4
git push origin v1.0.4

# Build with the new version
./build.sh

The version will be automatically detected and used for both the Python package and Rust crate.

Project Structure

vexy-glob/
├── src/                    # Rust source code
│   ├── lib.rs             # Main Rust library with PyO3 bindings
│   └── ...
├── vexy_glob/             # Python package
│   ├── __init__.py        # Python API wrapper
│   ├── __main__.py        # CLI implementation
│   └── ...
├── tests/                 # Python tests
│   ├── test_*.py          # Unit and integration tests
│   └── test_benchmarks.py # Performance benchmarks
├── Cargo.toml             # Rust project configuration
├── pyproject.toml         # Python project configuration
├── sync_version.py        # Version synchronization script
└── build.sh               # Build automation script

CI/CD

The project uses GitHub Actions for continuous integration:

Testing on Linux, macOS, and Windows
Python versions 3.8 through 3.12
Automatic wheel building for releases
Cross-platform compatibility testing

Exceptions and Error Handling

Exception Hierarchy

VexyGlobError(Exception)
├── PatternError(VexyGlobError, ValueError)
│   └── Raised for invalid glob patterns
├── SearchError(VexyGlobError, IOError)  
│   └── Raised for I/O or permission errors
└── TraversalNotSupportedError(VexyGlobError, NotImplementedError)
    └── Raised for unsupported operations

Error Handling Examples

import vexy_glob
from vexy_glob import VexyGlobError, PatternError, SearchError

try:
    # Invalid pattern
    for path in vexy_glob.find("[invalid"):
        print(path)
except PatternError as e:
    print(f"Invalid pattern: {e}")

try:
    # Permission denied or I/O error
    for path in vexy_glob.find("**/*", root="/root"):
        print(path)
except SearchError as e:
    print(f"Search failed: {e}")

# Handle any vexy_glob error
try:
    results = vexy_glob.find("**/*.py", content="[invalid regex")
except VexyGlobError as e:
    print(f"Operation failed: {e}")

Platform-Specific Considerations

Windows

Use forward slashes / in patterns (automatically converted)
Hidden files: Files with hidden attribute are included with hidden=True
Case sensitivity: Windows is case-insensitive by default

# Windows-specific examples
import vexy_glob

# These are equivalent on Windows
vexy_glob.find("C:/Users/*/Documents/*.docx")
vexy_glob.find("C:\\Users\\*\\Documents\\*.docx")  # Also works

# Find hidden files on Windows
for path in vexy_glob.find("**/*", hidden=True):
    print(path)

macOS

.DS_Store files are excluded by default (via .gitignore)
Case sensitivity depends on file system (usually case-insensitive)

# macOS-specific examples
import vexy_glob

# Exclude .DS_Store and other macOS metadata
for path in vexy_glob.find("**/*", exclude=["**/.DS_Store", "**/.Spotlight-V100", "**/.Trashes"]):
    print(path)

Linux

Always case-sensitive
Hidden files start with .
Respects standard Unix permissions

# Linux-specific examples
import vexy_glob

# Find files in home directory config
for path in vexy_glob.find("~/.config/**/*.conf", hidden=True):
    print(path)

Troubleshooting

Common Issues

1. No results found

# Check if you need hidden files
results = list(vexy_glob.find("*"))
if not results:
    # Try with hidden files
    results = list(vexy_glob.find("*", hidden=True))

# Check if .gitignore is excluding files
results = list(vexy_glob.find("**/*.py", ignore_git=True))

2. Pattern not matching expected files

# Debug pattern matching
import vexy_glob

# Too specific?
print(list(vexy_glob.find("src/lib/test.py")))  # Only exact match

# Use wildcards
print(list(vexy_glob.find("src/**/test.py")))   # Any depth
print(list(vexy_glob.find("src/*/test.py")))    # One level only

3. Content search not finding matches

# Check regex syntax
import vexy_glob

# Wrong: Python regex syntax
results = vexy_glob.find("**/*.py", content=r"import\s+{re,os}")

# Correct: Standard regex
results = vexy_glob.find("**/*.py", content=r"import\s+(re|os)")

# Case sensitivity
results = vexy_glob.find("**/*.py", content="TODO", case_sensitive=False)

4. Performance issues

# Optimize your search
import vexy_glob

# Slow: Searching everything
for path in vexy_glob.find("**/*.py", content="import"):
    print(path)

# Fast: Limit scope
for path in vexy_glob.find("src/**/*.py", content="import", max_depth=3):
    print(path)

# Use exclusions
for path in vexy_glob.find(
    "**/*.py",
    exclude=["**/node_modules/**", "**/.venv/**", "**/build/**"]
):
    print(path)

Build Issues

If you encounter build issues:

Rust not found: Install Rust from rustup.rs
maturin not found: Run pip install maturin
Version mismatch: Run python sync_version.py to sync versions
Import errors: Ensure you've run maturin develop after changes
Build fails: Check that you have the latest Rust stable toolchain

Debug Mode

import vexy_glob
import logging

# Enable debug logging
logging.basicConfig(level=logging.DEBUG)

# This will show internal operations
for path in vexy_glob.find("**/*.py"):
    print(path)

FAQ

Q: Why is vexy_glob so much faster than glob?

A: vexy_glob uses Rust's parallel directory traversal, releases Python's GIL, and streams results as they're found instead of collecting everything first.

Q: Does vexy_glob follow symbolic links?

A: By default, no. Use follow_symlinks=True to enable. Loop detection is built-in.

Q: Can I use vexy_glob with async/await?

A: Yes! Use it with asyncio.to_thread():

import asyncio
import vexy_glob

async def find_files():
    return await asyncio.to_thread(
        vexy_glob.find, "**/*.py", as_list=True
    )

Q: How do I search in multiple directories?

A: Call find() multiple times or use a common parent:

# Option 1: Multiple calls
results = []
for root in ["src", "tests", "docs"]:
    results.extend(vexy_glob.find("**/*.py", root=root, as_list=True))

# Option 2: Common parent with specific patterns
results = vexy_glob.find("{src,tests,docs}/**/*.py", as_list=True)

Q: Is the content search as powerful as ripgrep?

A: Yes! It uses the same grep-searcher crate that powers ripgrep, including SIMD optimizations.

Advanced Configuration

Custom Ignore Files

import vexy_glob

# By default, respects .gitignore
for path in vexy_glob.find("**/*.py"):
    print(path)

# Also respects .ignore and .fdignore files
# Create .ignore in your project root:
# echo "test_*.py" > .ignore

# Now test files will be excluded
for path in vexy_glob.find("**/*.py"):
    print(path)  # test_*.py files excluded

Thread Configuration

import vexy_glob
import os

# Auto-detect (default)
for path in vexy_glob.find("**/*.py"):
    pass

# Limit threads for CPU-bound operations
for match in vexy_glob.find("**/*.py", content="TODO", threads=2):
    pass

# Max parallelism for I/O-bound operations
cpu_count = os.cpu_count() or 4
for path in vexy_glob.find("**/*", threads=cpu_count * 2):
    pass

Contributing

We welcome contributions! Here's how to get started:

Fork the repository
Create a feature branch (git checkout -b feature-name)
Make your changes
Run tests (pytest tests/)
Format code (cargo fmt for Rust, ruff format for Python)
Commit with descriptive messages
Push and open a pull request

Before submitting:

Ensure all tests pass
Add tests for new functionality
Update documentation as needed
Follow existing code style

Running the Full Test Suite

# Python tests
pytest tests/ -v

# Python tests with coverage
pytest tests/ --cov=vexy_glob --cov-report=html

# Rust tests
cargo test

# Benchmarks
pytest tests/test_benchmarks.py -v --benchmark-only

# Linting
cargo clippy -- -D warnings
ruff check .

API Stability and Versioning

vexy_glob follows Semantic Versioning:

Major version (1.x.x): Breaking API changes
Minor version (x.1.x): New features, backwards compatible
Patch version (x.x.1): Bug fixes only

Stable API Guarantees

The following are guaranteed stable in 1.x:

find() function signature and basic parameters
glob() and iglob() compatibility functions
SearchResult object attributes
Exception hierarchy
CLI command structure

Experimental Features

Features marked experimental may change:

Thread count optimization algorithms
Internal buffer size tuning
Specific error message text

Performance Tuning Guide

For Maximum Speed

import vexy_glob

# 1. Be specific with patterns
# Slow:
vexy_glob.find("**/*.py")
# Fast:
vexy_glob.find("src/**/*.py")

# 2. Use depth limits when possible
vexy_glob.find("**/*.py", max_depth=3)

# 3. Exclude unnecessary directories
vexy_glob.find(
    "**/*.py",
    exclude=["**/venv/**", "**/node_modules/**", "**/.git/**"]
)

# 4. Use file type filters
vexy_glob.find("**/*.py", file_type="f")  # Skip directories

For Memory Efficiency

# Stream results instead of collecting
# Memory efficient:
for path in vexy_glob.find("**/*"):
    process(path)  # Process one at a time

# Memory intensive:
all_files = vexy_glob.find("**/*", as_list=True)  # Loads all in memory

For I/O Optimization

# Optimize thread count based on storage type
import vexy_glob

# SSD: More threads help
for path in vexy_glob.find("**/*", threads=8):
    pass

# HDD: Fewer threads to avoid seek thrashing
for path in vexy_glob.find("**/*", threads=2):
    pass

# Network storage: Single thread might be best
for path in vexy_glob.find("**/*", threads=1):
    pass

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments

Built on the excellent Rust crates:
- ignore - Fast directory traversal
- grep-searcher - High-performance text search
- globset - Efficient glob matching
Inspired by tools like fd and ripgrep
Thanks to the PyO3 team for excellent Python-Rust bindings

Related Projects

fd - A simple, fast alternative to find
ripgrep - Recursively search directories for a regex pattern
walkdir - Python's built-in directory traversal
scandir - Better directory iteration for Python

Happy fast file finding! 🚀

If you find vexy_glob useful, please consider giving it a star on GitHub!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.9

Aug 4, 2025

1.0.3

Aug 3, 2025

0.1.0

Aug 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vexy_glob-1.0.9.tar.gz (268.8 kB view details)

Uploaded Aug 4, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vexy_glob-1.0.9-cp38-abi3-macosx_10_12_x86_64.whl (1.1 MB view details)

Uploaded Aug 4, 2025 CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file vexy_glob-1.0.9.tar.gz.

File metadata

Download URL: vexy_glob-1.0.9.tar.gz
Upload date: Aug 4, 2025
Size: 268.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: python-httpx/0.28.1

File hashes

Hashes for vexy_glob-1.0.9.tar.gz
Algorithm	Hash digest
SHA256	`e334f8fb78d0e269768c4b8537f699821611c76b2ab62cbdb3c2298715071a08`
MD5	`53bb38887335ff4d866f5383e520e427`
BLAKE2b-256	`25abbe754b19c7acea5ad55aa5311f4935ce96d38fb9b10b07ec799efefe6597`

See more details on using hashes here.

File details

Details for the file vexy_glob-1.0.9-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

Download URL: vexy_glob-1.0.9-cp38-abi3-macosx_10_12_x86_64.whl
Upload date: Aug 4, 2025
Size: 1.1 MB
Tags: CPython 3.8+, macOS 10.12+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: python-httpx/0.28.1

File hashes

Hashes for vexy_glob-1.0.9-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm	Hash digest
SHA256	`fafc08862efcea87b309525ba0a47f1c827b969ef55ba7f5bdb9a91b59a9a324`
MD5	`4141c35cae85e8712e18b666c684902f`
BLAKE2b-256	`52a25d3c74a12fa93f6567e4c3d69c255b7cdcba58e6971f25c7b7672a288f53`

See more details on using hashes here.

vexy-glob 1.0.9

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

vexy_glob - Path Accelerated Finding in Rust

TL;DR

What is vexy_glob?

Architecture Overview

Key Features

Feature Comparison

How it Works

Why use vexy_glob?

When to Use vexy_glob vs Alternatives

Installation and Usage

Python Library

Command-Line Interface

Finding Files

Searching Content

Command-Line Options Reference

Unix Pipeline Integration

Advanced CLI Patterns

Detailed Python API Reference

Core Functions

Core Functions

vexy_glob.find()

Basic Syntax

Basic Examples

Content Searching

SearchResult Object

Content Search Examples

Filtering Options

Size Filtering

Time Filtering

Type and Extension Filtering

Exclusion Patterns

Pattern Matching Guide

Glob Pattern Syntax

Smart Case Detection

Drop-in Replacements

Performance

Benchmark Results

Performance Characteristics

Performance Tips

Cookbook - Real-World Examples

Working with Git Repositories

Finding Large Log Files

Finding Recently Modified Files

Code Search - Finding TODOs and FIXMEs

Finding Duplicate Files by Size

Cleaning Build Artifacts

Project Statistics

Integration with pandas

Parallel Processing Found Files

Migration Guide

Migrating from glob

Migrating from pathlib

Migrating from os.walk

Development

Prerequisites

Setting Up Development Environment

Building Release Artifacts

Quick Build

Manual Build

Build System Details

Versioning

Project Structure

CI/CD

Exceptions and Error Handling

Exception Hierarchy

Error Handling Examples

Platform-Specific Considerations

Windows

macOS

Linux

Troubleshooting

What is `vexy_glob`?

Why use `vexy_glob`?

`vexy_glob.find()`

Migrating from `glob`

Migrating from `pathlib`

Migrating from `os.walk`