Skip to main content

Vexy Glob fast file finding

Project description

vexy_glob - Path Accelerated Finding in Rust

PyPI version CI codecov

vexy_glob is a high-performance Python extension for file system traversal and content searching, built with Rust. It provides a faster and more feature-rich alternative to Python's built-in glob (up to 6x faster) and pathlib (up to 12x faster) modules.

TL;DR

Installation:

pip install vexy_glob

Quick Start:

Find all Python files in the current directory and its subdirectories:

import vexy_glob

for path in vexy_glob.find("**/*.py"):
    print(path)

Find all files containing the text "import asyncio":

for match in vexy_glob.find("**/*.py", content="import asyncio"):
    print(f"{match.path}:{match.line_number}: {match.line_text}")

What is vexy_glob?

vexy_glob is a Python library that provides a powerful and efficient way to find files and search for content within them. It's built on top of the excellent Rust crates ignore (for file traversal) and grep-searcher (for content searching), which are the same engines powering tools like fd and ripgrep.

This means you get the speed and efficiency of Rust, with the convenience and ease of use of Python.

Architecture Overview

┌─────────────────────┐
│   Python API Layer  │  ← Your Python code calls vexy_glob.find()
├─────────────────────┤
│    PyO3 Bindings    │  ← Zero-copy conversions between Python/Rust
├─────────────────────┤
│  Rust Core Engine   │  ← GIL released for true parallelism
│  ┌───────────────┐  │
│  │ ignore crate  │  │  ← Parallel directory traversal
│  │ (from fd)     │  │     Respects .gitignore files
│  └───────────────┘  │
│  ┌───────────────┐  │
│  │ grep-searcher │  │  ← High-speed content search
│  │ (from ripgrep)│  │     SIMD-accelerated regex
│  └───────────────┘  │
├─────────────────────┤
│ Streaming Channel   │  ← Results yielded as found
│ (crossbeam-channel) │     No memory accumulation
└─────────────────────┘

Key Features

  • 🚀 Blazing Fast: 10-100x faster than Python's glob and pathlib for many use cases.
  • ⚡ Streaming Results: Get the first results in milliseconds, without waiting for the entire file system scan to complete.
  • 💾 Memory Efficient: vexy_glob uses constant memory, regardless of the number of files or results.
  • 🔥 Parallel Execution: Utilizes all your CPU cores to get the job done as quickly as possible.
  • 🔍 Content Searching: Ripgrep-style content searching with regex support.
  • 🎯 Rich Filtering: Filter files by size, modification time, and more.
  • 🧠 Smart Defaults: Automatically respects .gitignore files and skips hidden files and directories.
  • 🌍 Cross-Platform: Works on Linux, macOS, and Windows.

Feature Comparison

Feature glob.glob() pathlib vexy_glob
Pattern matching ✅ Basic ✅ Basic ✅ Advanced
Recursive search ✅ Slow ✅ Slow ✅ Fast
Streaming results
Content search
.gitignore respect
Parallel execution
Size filtering
Time filtering
Memory efficiency

How it Works

vexy_glob uses a Rust-powered backend to perform the heavy lifting of file system traversal and content searching. The Rust extension releases Python's Global Interpreter Lock (GIL), allowing for true parallelism and a significant performance boost.

Results are streamed back to Python as they are found, using a producer-consumer architecture with crossbeam channels. This means you can start processing results immediately, without having to wait for the entire search to finish.

Why use vexy_glob?

If you find yourself writing scripts that need to find files based on patterns, or search for content within files, vexy_glob can be a game-changer. It's particularly useful for:

  • Large codebases: Quickly find files or code snippets in large projects.
  • Log file analysis: Search through gigabytes of logs in seconds.
  • Data processing pipelines: Efficiently find and process files based on various criteria.
  • Build systems: Fast dependency scanning and file collection.
  • Data science: Quickly locate and process data files.
  • DevOps: Log analysis, configuration management, deployment scripts.
  • Testing: Find test files, fixtures, and coverage reports.
  • Anywhere you need to find files fast!

When to Use vexy_glob vs Alternatives

Use Case Best Tool Why
Simple pattern in small directory glob.glob() Built-in, no dependencies
Large directory, need first result fast vexy_glob Streaming results
Search file contents vexy_glob Integrated content search
Complex filtering (size, time, etc.) vexy_glob Rich filtering API
Cross-platform scripts vexy_glob Consistent behavior
Git-aware file finding vexy_glob Respects .gitignore
Memory-constrained environment vexy_glob Constant memory usage

Installation and Usage

Python Library

Install vexy_glob using pip:

pip install vexy_glob

Then use it in your Python code:

import vexy_glob

# Find all Python files
for path in vexy_glob.find("**/*.py"):
    print(path)

Command-Line Interface

vexy_glob also provides a powerful command-line interface for finding files and searching content directly from your terminal.

Finding Files

Use vexy_glob find to locate files matching glob patterns:

# Find all Python files
vexy_glob find "**/*.py"

# Find all markdown files larger than 10KB
vexy_glob find "**/*.md" --min-size 10k

# Find all log files modified in the last 2 days
vexy_glob find "*.log" --mtime-after -2d

# Find only directories
vexy_glob find "*" --type d

# Include hidden files
vexy_glob find "*" --hidden

# Limit search depth
vexy_glob find "**/*.txt" --depth 2

Searching Content

Use vexy_glob search to find content within files:

# Search for "import asyncio" in Python files
vexy_glob search "**/*.py" "import asyncio"

# Search for function definitions using regex
vexy_glob search "src/**/*.rs" "fn\\s+\\w+"

# Search without color output (for piping)
vexy_glob search "**/*.md" "TODO|FIXME" --no-color

# Case-sensitive search
vexy_glob search "*.txt" "Error" --case-sensitive

# Search with size filters
vexy_glob search "**/*.log" "ERROR" --min-size 1M --max-size 100M

# Search recent files only
vexy_glob search "**/*.py" "TODO" --mtime-after -7d

# Complex search with multiple filters
vexy_glob search "src/**/*.{py,js}" "console\.log|print\(" \
    --exclude "*test*" \
    --mtime-after -30d \
    --max-size 50k

Command-Line Options Reference

Common options for both find and search:

Option Type Description Example
--root PATH Root directory to start search --root /home/user/projects
--min-size SIZE Minimum file size --min-size 10k
--max-size SIZE Maximum file size --max-size 5M
--mtime-after TIME Modified after this time --mtime-after -7d
--mtime-before TIME Modified before this time --mtime-before 2024-01-01
--atime-after TIME Accessed after this time --atime-after -1h
--atime-before TIME Accessed before this time --atime-before -30d
--ctime-after TIME Created after this time --ctime-after -1w
--ctime-before TIME Created before this time --ctime-before -1y
--no-gitignore FLAG Don't respect .gitignore --no-gitignore
--hidden FLAG Include hidden files --hidden
--case-sensitive FLAG Force case sensitivity --case-sensitive
--type CHAR File type (f/d/l) --type f
--extension STR File extension(s) --extension py
--exclude PATTERN Exclude patterns --exclude "*test*"
--depth INT Maximum directory depth --depth 3
--follow-symlinks FLAG Follow symbolic links --follow-symlinks

Additional options for search:

Option Type Description Example
--no-color FLAG Disable colored output --no-color

Size format examples:

  • Bytes: 1024 or "1024"
  • Kilobytes: 10k, 10K, 10kb, 10KB
  • Megabytes: 5m, 5M, 5mb, 5MB
  • Gigabytes: 2g, 2G, 2gb, 2GB
  • With decimals: 1.5M, 2.7G, 0.5K

Time format examples:

  • Relative: -30s, -5m, -2h, -7d, -2w, -1mo, -1y
  • ISO date: 2024-01-01, 2024-01-01T10:30:00
  • Natural: yesterday, today (converted to ISO dates)

Unix Pipeline Integration

vexy_glob works seamlessly with Unix pipelines:

# Count Python files
vexy_glob find "**/*.py" | wc -l

# Find Python files containing "async" and edit them
vexy_glob search "**/*.py" "async" --no-color | cut -d: -f1 | sort -u | xargs $EDITOR

# Find large log files and show their sizes
vexy_glob find "*.log" --min-size 100M | xargs ls -lh

# Search for TODOs and format as tasks
vexy_glob search "**/*.py" "TODO" --no-color | awk -F: '{print "- [ ] " $1 ":" $2 ": " $3}'

# Find duplicate file names
vexy_glob find "**/*" --type f | xargs -n1 basename | sort | uniq -d

# Create archive of recent changes
vexy_glob find "**/*" --mtime-after -7d --type f | tar -czf recent_changes.tar.gz -T -

# Find and replace across files
vexy_glob search "**/*.py" "OldClassName" --no-color | cut -d: -f1 | sort -u | xargs sed -i 's/OldClassName/NewClassName/g'

# Generate ctags for Python files
vexy_glob find "**/*.py" | ctags -L -

# Find empty directories
vexy_glob find "**" --type d | while read dir; do [ -z "$(ls -A "$dir")" ] && echo "$dir"; done

# Calculate total size of Python files
vexy_glob find "**/*.py" --type f | xargs stat -f%z | awk '{s+=$1} END {print s}' | numfmt --to=iec

Advanced CLI Patterns

# Monitor for file changes (poor man's watch)
while true; do
    clear
    echo "Files modified in last minute:"
    vexy_glob find "**/*" --mtime-after -1m --type f
    sleep 10
done

# Parallel processing with GNU parallel
vexy_glob find "**/*.jpg" | parallel -j4 convert {} {.}_thumb.jpg

# Create a file manifest with checksums
vexy_glob find "**/*" --type f | while read -r file; do
    echo "$(sha256sum "$file" | cut -d' ' -f1) $file"
done > manifest.txt

# Find files by content and show context
vexy_glob search "**/*.py" "class.*Error" --no-color | while IFS=: read -r file line rest; do
    echo "\n=== $file:$line ==="
    sed -n "$((line-2)),$((line+2))p" "$file"
done

Detailed Python API Reference

Core Functions

Core Functions

vexy_glob.find()

The main function for finding files and searching content.

Basic Syntax
def find(
    pattern: str = "*",
    root: Union[str, Path] = ".",
    *,
    content: Optional[str] = None,
    file_type: Optional[str] = None,
    extension: Optional[Union[str, List[str]]] = None,
    max_depth: Optional[int] = None,
    min_depth: int = 0,
    min_size: Optional[int] = None,
    max_size: Optional[int] = None,
    mtime_after: Optional[Union[float, int, str, datetime]] = None,
    mtime_before: Optional[Union[float, int, str, datetime]] = None,
    atime_after: Optional[Union[float, int, str, datetime]] = None,
    atime_before: Optional[Union[float, int, str, datetime]] = None,
    ctime_after: Optional[Union[float, int, str, datetime]] = None,
    ctime_before: Optional[Union[float, int, str, datetime]] = None,
    hidden: bool = False,
    ignore_git: bool = False,
    case_sensitive: Optional[bool] = None,
    follow_symlinks: bool = False,
    threads: Optional[int] = None,
    as_path: bool = False,
    as_list: bool = False,
    exclude: Optional[Union[str, List[str]]] = None,
) -> Union[Iterator[Union[str, Path, SearchResult]], List[Union[str, Path, SearchResult]]]:
    """Find files matching pattern with optional content search.
    
    Args:
        pattern: Glob pattern to match files (e.g., "**/*.py", "src/*.js")
        root: Root directory to start search from
        content: Regex pattern to search within files
        file_type: Filter by type - 'f' (file), 'd' (directory), 'l' (symlink)
        extension: File extension(s) to filter by (e.g., "py" or ["py", "pyi"])
        max_depth: Maximum directory depth to search
        min_depth: Minimum directory depth to search
        min_size: Minimum file size in bytes (or use parse_size())
        max_size: Maximum file size in bytes
        mtime_after: Files modified after this time
        mtime_before: Files modified before this time
        atime_after: Files accessed after this time
        atime_before: Files accessed before this time
        ctime_after: Files created after this time
        ctime_before: Files created before this time
        hidden: Include hidden files and directories
        ignore_git: Don't respect .gitignore files
        case_sensitive: Case sensitivity (None = smart case)
        follow_symlinks: Follow symbolic links
        threads: Number of threads (None = auto)
        as_path: Return Path objects instead of strings
        as_list: Return list instead of iterator
        exclude: Patterns to exclude from results
    
    Returns:
        Iterator or list of file paths (or SearchResult if content is specified)
    """
Basic Examples
import vexy_glob

# Find all Python files
for path in vexy_glob.find("**/*.py"):
    print(path)

# Find all files in the 'src' directory
for path in vexy_glob.find("src/**/*"):
    print(path)

# Get results as a list instead of iterator
python_files = vexy_glob.find("**/*.py", as_list=True)
print(f"Found {len(python_files)} Python files")

# Get results as Path objects
from pathlib import Path
for path in vexy_glob.find("**/*.md", as_path=True):
    print(path.stem)  # Path object methods available

Content Searching

To search for content within files, use the content parameter. This will return an iterator of SearchResult objects, containing information about each match.

import vexy_glob

for match in vexy_glob.find("*.py", content="import requests"):
    print(f"Found a match in {match.path} on line {match.line_number}:")
    print(f"  {match.line_text.strip()}")

SearchResult Object

The SearchResult object has the following attributes:

  • path: The path to the file containing the match.
  • line_number: The line number of the match (1-indexed).
  • line_text: The text of the line containing the match.
  • matches: A list of matched strings on the line.

Content Search Examples

# Simple text search
for match in vexy_glob.find("**/*.py", content="TODO"):
    print(f"{match.path}:{match.line_number}: {match.line_text.strip()}")

# Regex pattern search
for match in vexy_glob.find("**/*.py", content=r"def\s+\w+\(.*\):"):
    print(f"Function at {match.path}:{match.line_number}")

# Case-insensitive search
for match in vexy_glob.find("**/*.md", content="python", case_sensitive=False):
    print(match.path)

# Multiple pattern search with OR
for match in vexy_glob.find("**/*.py", content="import (os|sys|pathlib)"):
    print(f"{match.path}: imports {match.matches}")

Filtering Options

Size Filtering

vexy_glob supports human-readable size formats:

import vexy_glob

# Using parse_size() for readable formats
min_size = vexy_glob.parse_size("10K")   # 10 kilobytes
max_size = vexy_glob.parse_size("5.5M")  # 5.5 megabytes

for path in vexy_glob.find("**/*", min_size=min_size, max_size=max_size):
    print(path)

# Supported formats:
# - Bytes: "1024" or 1024
# - Kilobytes: "10K", "10KB", "10k", "10kb"
# - Megabytes: "5M", "5MB", "5m", "5mb"
# - Gigabytes: "2G", "2GB", "2g", "2gb"
# - Decimal: "1.5M", "2.7G"

Time Filtering

vexy_glob accepts multiple time formats:

import vexy_glob
from datetime import datetime, timedelta

# 1. Relative time formats
for path in vexy_glob.find("**/*.log", mtime_after="-1d"):     # Last 24 hours
    print(path)

# Supported relative formats:
# - Seconds: "-30s" or "-30"
# - Minutes: "-5m"
# - Hours: "-2h"
# - Days: "-7d"
# - Weeks: "-2w"
# - Months: "-1mo" (30 days)
# - Years: "-1y" (365 days)

# 2. ISO date formats
for path in vexy_glob.find("**/*", mtime_after="2024-01-01"):
    print(path)

# Supported ISO formats:
# - Date: "2024-01-01"
# - DateTime: "2024-01-01T10:30:00"
# - With timezone: "2024-01-01T10:30:00Z"

# 3. Python datetime objects
week_ago = datetime.now() - timedelta(weeks=1)
for path in vexy_glob.find("**/*", mtime_after=week_ago):
    print(path)

# 4. Unix timestamps
import time
hour_ago = time.time() - 3600
for path in vexy_glob.find("**/*", mtime_after=hour_ago):
    print(path)

# Combining time filters
for path in vexy_glob.find(
    "**/*.py",
    mtime_after="-30d",      # Modified within 30 days
    mtime_before="-1d"       # But not in the last 24 hours
):
    print(path)

Type and Extension Filtering

import vexy_glob

# Filter by file type
for path in vexy_glob.find("**/*", file_type="d"):  # Directories only
    print(f"Directory: {path}")

# File types:
# - "f": Regular files
# - "d": Directories
# - "l": Symbolic links

# Filter by extension
for path in vexy_glob.find("**/*", extension="py"):
    print(path)

# Multiple extensions
for path in vexy_glob.find("**/*", extension=["py", "pyi", "pyx"]):
    print(path)

Exclusion Patterns

import vexy_glob

# Exclude single pattern
for path in vexy_glob.find("**/*.py", exclude="*test*"):
    print(path)

# Exclude multiple patterns
exclusions = [
    "**/__pycache__/**",
    "**/node_modules/**",
    "**/.git/**",
    "**/build/**",
    "**/dist/**"
]
for path in vexy_glob.find("**/*", exclude=exclusions):
    print(path)

# Exclude specific files
for path in vexy_glob.find(
    "**/*.py",
    exclude=["setup.py", "**/conftest.py", "**/*_test.py"]
):
    print(path)

Pattern Matching Guide

Glob Pattern Syntax

Pattern Matches Example
* Any characters (except /) *.py matches test.py
** Any characters including / **/*.py matches src/lib/test.py
? Single character test?.py matches test1.py
[seq] Character in sequence test[123].py matches test2.py
[!seq] Character not in sequence test[!0].py matches test1.py
{a,b} Either pattern a or b *.{py,js} matches .py and .js files

Smart Case Detection

By default, vexy_glob uses smart case detection:

  • If pattern contains uppercase → case-sensitive
  • If pattern is all lowercase → case-insensitive
# Case-insensitive (finds README.md, readme.md, etc.)
vexy_glob.find("readme.md")

# Case-sensitive (only finds README.md)
vexy_glob.find("README.md")

# Force case sensitivity
vexy_glob.find("readme.md", case_sensitive=True)

Drop-in Replacements

vexy_glob provides drop-in replacements for standard library functions:

# Replace glob.glob()
import vexy_glob
files = vexy_glob.glob("**/*.py", recursive=True)

# Replace glob.iglob()
for path in vexy_glob.iglob("**/*.py", recursive=True):
    print(path)

# Migration from standard library
# OLD:
import glob
files = glob.glob("**/*.py", recursive=True)

# NEW: Just change the import!
import vexy_glob as glob
files = glob.glob("**/*.py", recursive=True)  # 10-100x faster!

Performance

Benchmark Results

Benchmarks on a directory with 100,000 files:

Operation glob.glob() pathlib vexy_glob Speedup
Find all .py files 15.2s 18.1s 0.2s 76x
Time to first result 15.2s 18.1s 0.005s 3040x
Memory usage 1.2GB 1.5GB 45MB 27x less
With .gitignore N/A N/A 0.15s N/A

Performance Characteristics

  • Linear scaling: Performance scales linearly with file count
  • I/O bound: SSD vs HDD makes a significant difference
  • Cache friendly: Repeated searches benefit from OS file cache
  • Memory constant: Uses ~45MB regardless of result count

Performance Tips

  1. Use specific patterns: src/**/*.py is faster than **/*.py
  2. Limit depth: Use max_depth when you know the structure
  3. Exclude early: Use exclude patterns to skip large directories
  4. Leverage .gitignore: Default behavior skips ignored files

Cookbook - Real-World Examples

Working with Git Repositories

import vexy_glob

# Find all Python files, respecting .gitignore (default behavior)
for path in vexy_glob.find("**/*.py"):
    print(path)

# Include files that are gitignored
for path in vexy_glob.find("**/*.py", ignore_git=True):
    print(path)

Finding Large Log Files

import vexy_glob

# Find log files larger than 100MB
for path in vexy_glob.find("**/*.log", min_size=vexy_glob.parse_size("100M")):
    size_mb = os.path.getsize(path) / 1024 / 1024
    print(f"{path}: {size_mb:.1f}MB")

# Find log files between 10MB and 1GB
for path in vexy_glob.find(
    "**/*.log",
    min_size=vexy_glob.parse_size("10M"),
    max_size=vexy_glob.parse_size("1G")
):
    print(path)

Finding Recently Modified Files

import vexy_glob
from datetime import datetime, timedelta

# Files modified in the last 24 hours
for path in vexy_glob.find("**/*", mtime_after="-1d"):
    print(path)

# Files modified between 1 and 7 days ago
for path in vexy_glob.find(
    "**/*",
    mtime_after="-7d",
    mtime_before="-1d"
):
    print(path)

# Files modified after a specific date
for path in vexy_glob.find("**/*", mtime_after="2024-01-01"):
    print(path)

Code Search - Finding TODOs and FIXMEs

import vexy_glob

# Find all TODO comments in Python files
for match in vexy_glob.find("**/*.py", content=r"TODO|FIXME"):
    print(f"{match.path}:{match.line_number}: {match.line_text.strip()}")

# Find specific function definitions
for match in vexy_glob.find("**/*.py", content=r"def\s+process_data"):
    print(f"Found function at {match.path}:{match.line_number}")

Finding Duplicate Files by Size

import vexy_glob
from collections import defaultdict

# Group files by size to find potential duplicates
size_groups = defaultdict(list)

for path in vexy_glob.find("**/*", file_type="f"):
    size = os.path.getsize(path)
    if size > 0:  # Skip empty files
        size_groups[size].append(path)

# Print potential duplicates
for size, paths in size_groups.items():
    if len(paths) > 1:
        print(f"\nPotential duplicates ({size} bytes):")
        for path in paths:
            print(f"  {path}")

Cleaning Build Artifacts

import vexy_glob
import os

# Find and remove Python cache files
cache_patterns = [
    "**/__pycache__/**",
    "**/*.pyc",
    "**/*.pyo",
    "**/.pytest_cache/**",
    "**/.mypy_cache/**"
]

for pattern in cache_patterns:
    for path in vexy_glob.find(pattern, hidden=True):
        if os.path.isfile(path):
            os.remove(path)
            print(f"Removed: {path}")
        elif os.path.isdir(path):
            shutil.rmtree(path)
            print(f"Removed directory: {path}")

Project Statistics

import vexy_glob
from collections import Counter
import os

# Count files by extension
extension_counts = Counter()

for path in vexy_glob.find("**/*", file_type="f"):
    ext = os.path.splitext(path)[1].lower()
    if ext:
        extension_counts[ext] += 1

# Print top 10 file types
print("Top 10 file types in project:")
for ext, count in extension_counts.most_common(10):
    print(f"  {ext}: {count} files")

# Advanced statistics
total_size = 0
file_count = 0
largest_file = None
largest_size = 0

for path in vexy_glob.find("**/*", file_type="f"):
    size = os.path.getsize(path)
    total_size += size
    file_count += 1
    if size > largest_size:
        largest_size = size
        largest_file = path

print(f"\nProject Statistics:")
print(f"Total files: {file_count:,}")
print(f"Total size: {total_size / 1024 / 1024:.1f} MB")
print(f"Average file size: {total_size / file_count / 1024:.1f} KB")
print(f"Largest file: {largest_file} ({largest_size / 1024 / 1024:.1f} MB)")

Integration with pandas

import vexy_glob
import pandas as pd
import os

# Create a DataFrame of all Python files with metadata
file_data = []

for path in vexy_glob.find("**/*.py"):
    stat = os.stat(path)
    file_data.append({
        'path': path,
        'size': stat.st_size,
        'modified': pd.Timestamp(stat.st_mtime, unit='s'),
        'lines': sum(1 for _ in open(path, 'r', errors='ignore'))
    })

df = pd.DataFrame(file_data)

# Analyze the data
print(f"Total Python files: {len(df)}")
print(f"Total lines of code: {df['lines'].sum():,}")
print(f"Average file size: {df['size'].mean():.0f} bytes")
print(f"\nLargest files:")
print(df.nlargest(5, 'size')[['path', 'size', 'lines']])

Parallel Processing Found Files

import vexy_glob
from concurrent.futures import ProcessPoolExecutor
import os

def process_file(path):
    """Process a single file (e.g., count lines)"""
    try:
        with open(path, 'r', encoding='utf-8') as f:
            return path, sum(1 for _ in f)
    except:
        return path, 0

# Process all Python files in parallel
with ProcessPoolExecutor() as executor:
    # Get all files as a list
    files = vexy_glob.find("**/*.py", as_list=True)
    
    # Process in parallel
    results = executor.map(process_file, files)
    
    # Collect results
    total_lines = 0
    for path, lines in results:
        total_lines += lines
        if lines > 1000:
            print(f"Large file: {path} ({lines} lines)")
    
    print(f"\nTotal lines of code: {total_lines:,}")

Migration Guide

Migrating from glob

# OLD: Using glob
import glob
import os

# Find all Python files
files = glob.glob("**/*.py", recursive=True)

# Filter by size manually
large_files = []
for f in files:
    if os.path.getsize(f) > 1024 * 1024:  # 1MB
        large_files.append(f)

# NEW: Using vexy_glob
import vexy_glob

# Find large Python files directly
large_files = vexy_glob.find("**/*.py", min_size=1024*1024, as_list=True)

Migrating from pathlib

# OLD: Using pathlib
from pathlib import Path

# Find all Python files
files = list(Path(".").rglob("*.py"))

# Filter by modification time manually
import datetime
recent = []
for f in files:
    if f.stat().st_mtime > (datetime.datetime.now() - datetime.timedelta(days=7)).timestamp():
        recent.append(f)

# NEW: Using vexy_glob
import vexy_glob

# Find recent Python files directly
recent = vexy_glob.find("**/*.py", mtime_after="-7d", as_path=True, as_list=True)

Migrating from os.walk

# OLD: Using os.walk
import os

# Find all .txt files
txt_files = []
for root, dirs, files in os.walk("."):
    for file in files:
        if file.endswith(".txt"):
            txt_files.append(os.path.join(root, file))

# NEW: Using vexy_glob
import vexy_glob

# Much simpler and faster!
txt_files = vexy_glob.find("**/*.txt", as_list=True)

Development

This project is built with maturin - a tool for building and publishing Rust-based Python extensions.

Prerequisites

  • Python 3.8 or later
  • Rust toolchain (install from rustup.rs)
  • uv for fast Python package management (optional but recommended)

Setting Up Development Environment

# Clone the repository
git clone https://github.com/vexyart/vexy-glob.git
cd vexy-glob

# Set up a virtual environment (using uv for faster installation)
pip install uv
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install development dependencies
uv sync

# Build the Rust extension in development mode
python sync_version.py  # Sync version from git tags to Cargo.toml
maturin develop

# Run tests
pytest tests/

# Run benchmarks
pytest tests/test_benchmarks.py -v --benchmark-only

Building Release Artifacts

The project uses a streamlined build system with automatic versioning from git tags.

Quick Build

# Build both wheel and source distribution
./build.sh

This script will:

  1. Sync the version from git tags to Cargo.toml
  2. Build an optimized wheel for your platform
  3. Build a source distribution (sdist)
  4. Place all artifacts in the dist/ directory

Manual Build

# Ensure you have the latest tags
git fetch --tags

# Sync version to Cargo.toml
python sync_version.py

# Build wheel (platform-specific)
python -m maturin build --release -o dist/

# Build source distribution
python -m maturin sdist -o dist/

Build System Details

The project uses:

  • maturin as the build backend for creating Python wheels from Rust code
  • setuptools-scm for automatic versioning based on git tags
  • sync_version.py to synchronize versions between git tags and Cargo.toml

Key files:

  • pyproject.toml - Python project configuration with maturin as build backend
  • Cargo.toml - Rust project configuration
  • sync_version.py - Version synchronization script
  • build.sh - Convenience build script

Versioning

Versions are managed through git tags:

# Create a new version tag
git tag v1.0.4
git push origin v1.0.4

# Build with the new version
./build.sh

The version will be automatically detected and used for both the Python package and Rust crate.

Project Structure

vexy-glob/
├── src/                    # Rust source code
│   ├── lib.rs             # Main Rust library with PyO3 bindings
│   └── ...
├── vexy_glob/             # Python package
│   ├── __init__.py        # Python API wrapper
│   ├── __main__.py        # CLI implementation
│   └── ...
├── tests/                 # Python tests
│   ├── test_*.py          # Unit and integration tests
│   └── test_benchmarks.py # Performance benchmarks
├── Cargo.toml             # Rust project configuration
├── pyproject.toml         # Python project configuration
├── sync_version.py        # Version synchronization script
└── build.sh               # Build automation script

CI/CD

The project uses GitHub Actions for continuous integration:

  • Testing on Linux, macOS, and Windows
  • Python versions 3.8 through 3.12
  • Automatic wheel building for releases
  • Cross-platform compatibility testing

Exceptions and Error Handling

Exception Hierarchy

VexyGlobError(Exception)
├── PatternError(VexyGlobError, ValueError)
   └── Raised for invalid glob patterns
├── SearchError(VexyGlobError, IOError)  
   └── Raised for I/O or permission errors
└── TraversalNotSupportedError(VexyGlobError, NotImplementedError)
    └── Raised for unsupported operations

Error Handling Examples

import vexy_glob
from vexy_glob import VexyGlobError, PatternError, SearchError

try:
    # Invalid pattern
    for path in vexy_glob.find("[invalid"):
        print(path)
except PatternError as e:
    print(f"Invalid pattern: {e}")

try:
    # Permission denied or I/O error
    for path in vexy_glob.find("**/*", root="/root"):
        print(path)
except SearchError as e:
    print(f"Search failed: {e}")

# Handle any vexy_glob error
try:
    results = vexy_glob.find("**/*.py", content="[invalid regex")
except VexyGlobError as e:
    print(f"Operation failed: {e}")

Platform-Specific Considerations

Windows

  • Use forward slashes / in patterns (automatically converted)
  • Hidden files: Files with hidden attribute are included with hidden=True
  • Case sensitivity: Windows is case-insensitive by default
# Windows-specific examples
import vexy_glob

# These are equivalent on Windows
vexy_glob.find("C:/Users/*/Documents/*.docx")
vexy_glob.find("C:\\Users\\*\\Documents\\*.docx")  # Also works

# Find hidden files on Windows
for path in vexy_glob.find("**/*", hidden=True):
    print(path)

macOS

  • .DS_Store files are excluded by default (via .gitignore)
  • Case sensitivity depends on file system (usually case-insensitive)
# macOS-specific examples
import vexy_glob

# Exclude .DS_Store and other macOS metadata
for path in vexy_glob.find("**/*", exclude=["**/.DS_Store", "**/.Spotlight-V100", "**/.Trashes"]):
    print(path)

Linux

  • Always case-sensitive
  • Hidden files start with .
  • Respects standard Unix permissions
# Linux-specific examples
import vexy_glob

# Find files in home directory config
for path in vexy_glob.find("~/.config/**/*.conf", hidden=True):
    print(path)

Troubleshooting

Common Issues

1. No results found

# Check if you need hidden files
results = list(vexy_glob.find("*"))
if not results:
    # Try with hidden files
    results = list(vexy_glob.find("*", hidden=True))

# Check if .gitignore is excluding files
results = list(vexy_glob.find("**/*.py", ignore_git=True))

2. Pattern not matching expected files

# Debug pattern matching
import vexy_glob

# Too specific?
print(list(vexy_glob.find("src/lib/test.py")))  # Only exact match

# Use wildcards
print(list(vexy_glob.find("src/**/test.py")))   # Any depth
print(list(vexy_glob.find("src/*/test.py")))    # One level only

3. Content search not finding matches

# Check regex syntax
import vexy_glob

# Wrong: Python regex syntax
results = vexy_glob.find("**/*.py", content=r"import\s+{re,os}")

# Correct: Standard regex
results = vexy_glob.find("**/*.py", content=r"import\s+(re|os)")

# Case sensitivity
results = vexy_glob.find("**/*.py", content="TODO", case_sensitive=False)

4. Performance issues

# Optimize your search
import vexy_glob

# Slow: Searching everything
for path in vexy_glob.find("**/*.py", content="import"):
    print(path)

# Fast: Limit scope
for path in vexy_glob.find("src/**/*.py", content="import", max_depth=3):
    print(path)

# Use exclusions
for path in vexy_glob.find(
    "**/*.py",
    exclude=["**/node_modules/**", "**/.venv/**", "**/build/**"]
):
    print(path)

Build Issues

If you encounter build issues:

  1. Rust not found: Install Rust from rustup.rs
  2. maturin not found: Run pip install maturin
  3. Version mismatch: Run python sync_version.py to sync versions
  4. Import errors: Ensure you've run maturin develop after changes
  5. Build fails: Check that you have the latest Rust stable toolchain

Debug Mode

import vexy_glob
import logging

# Enable debug logging
logging.basicConfig(level=logging.DEBUG)

# This will show internal operations
for path in vexy_glob.find("**/*.py"):
    print(path)

FAQ

Q: Why is vexy_glob so much faster than glob?

A: vexy_glob uses Rust's parallel directory traversal, releases Python's GIL, and streams results as they're found instead of collecting everything first.

Q: Does vexy_glob follow symbolic links?

A: By default, no. Use follow_symlinks=True to enable. Loop detection is built-in.

Q: Can I use vexy_glob with async/await?

A: Yes! Use it with asyncio.to_thread():

import asyncio
import vexy_glob

async def find_files():
    return await asyncio.to_thread(
        vexy_glob.find, "**/*.py", as_list=True
    )

Q: How do I search in multiple directories?

A: Call find() multiple times or use a common parent:

# Option 1: Multiple calls
results = []
for root in ["src", "tests", "docs"]:
    results.extend(vexy_glob.find("**/*.py", root=root, as_list=True))

# Option 2: Common parent with specific patterns
results = vexy_glob.find("{src,tests,docs}/**/*.py", as_list=True)

Q: Is the content search as powerful as ripgrep?

A: Yes! It uses the same grep-searcher crate that powers ripgrep, including SIMD optimizations.

Advanced Configuration

Custom Ignore Files

import vexy_glob

# By default, respects .gitignore
for path in vexy_glob.find("**/*.py"):
    print(path)

# Also respects .ignore and .fdignore files
# Create .ignore in your project root:
# echo "test_*.py" > .ignore

# Now test files will be excluded
for path in vexy_glob.find("**/*.py"):
    print(path)  # test_*.py files excluded

Thread Configuration

import vexy_glob
import os

# Auto-detect (default)
for path in vexy_glob.find("**/*.py"):
    pass

# Limit threads for CPU-bound operations
for match in vexy_glob.find("**/*.py", content="TODO", threads=2):
    pass

# Max parallelism for I/O-bound operations
cpu_count = os.cpu_count() or 4
for path in vexy_glob.find("**/*", threads=cpu_count * 2):
    pass

Contributing

We welcome contributions! Here's how to get started:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature-name)
  3. Make your changes
  4. Run tests (pytest tests/)
  5. Format code (cargo fmt for Rust, ruff format for Python)
  6. Commit with descriptive messages
  7. Push and open a pull request

Before submitting:

  • Ensure all tests pass
  • Add tests for new functionality
  • Update documentation as needed
  • Follow existing code style

Running the Full Test Suite

# Python tests
pytest tests/ -v

# Python tests with coverage
pytest tests/ --cov=vexy_glob --cov-report=html

# Rust tests
cargo test

# Benchmarks
pytest tests/test_benchmarks.py -v --benchmark-only

# Linting
cargo clippy -- -D warnings
ruff check .

API Stability and Versioning

vexy_glob follows Semantic Versioning:

  • Major version (1.x.x): Breaking API changes
  • Minor version (x.1.x): New features, backwards compatible
  • Patch version (x.x.1): Bug fixes only

Stable API Guarantees

The following are guaranteed stable in 1.x:

  • find() function signature and basic parameters
  • glob() and iglob() compatibility functions
  • SearchResult object attributes
  • Exception hierarchy
  • CLI command structure

Experimental Features

Features marked experimental may change:

  • Thread count optimization algorithms
  • Internal buffer size tuning
  • Specific error message text

Performance Tuning Guide

For Maximum Speed

import vexy_glob

# 1. Be specific with patterns
# Slow:
vexy_glob.find("**/*.py")
# Fast:
vexy_glob.find("src/**/*.py")

# 2. Use depth limits when possible
vexy_glob.find("**/*.py", max_depth=3)

# 3. Exclude unnecessary directories
vexy_glob.find(
    "**/*.py",
    exclude=["**/venv/**", "**/node_modules/**", "**/.git/**"]
)

# 4. Use file type filters
vexy_glob.find("**/*.py", file_type="f")  # Skip directories

For Memory Efficiency

# Stream results instead of collecting
# Memory efficient:
for path in vexy_glob.find("**/*"):
    process(path)  # Process one at a time

# Memory intensive:
all_files = vexy_glob.find("**/*", as_list=True)  # Loads all in memory

For I/O Optimization

# Optimize thread count based on storage type
import vexy_glob

# SSD: More threads help
for path in vexy_glob.find("**/*", threads=8):
    pass

# HDD: Fewer threads to avoid seek thrashing
for path in vexy_glob.find("**/*", threads=2):
    pass

# Network storage: Single thread might be best
for path in vexy_glob.find("**/*", threads=1):
    pass

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments

  • Built on the excellent Rust crates:
  • Inspired by tools like fd and ripgrep
  • Thanks to the PyO3 team for excellent Python-Rust bindings

Related Projects

  • fd - A simple, fast alternative to find
  • ripgrep - Recursively search directories for a regex pattern
  • walkdir - Python's built-in directory traversal
  • scandir - Better directory iteration for Python

Happy fast file finding! 🚀

If you find vexy_glob useful, please consider giving it a star on GitHub!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vexy_glob-1.0.9.tar.gz (268.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vexy_glob-1.0.9-cp38-abi3-macosx_10_12_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file vexy_glob-1.0.9.tar.gz.

File metadata

  • Download URL: vexy_glob-1.0.9.tar.gz
  • Upload date:
  • Size: 268.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.28.1

File hashes

Hashes for vexy_glob-1.0.9.tar.gz
Algorithm Hash digest
SHA256 e334f8fb78d0e269768c4b8537f699821611c76b2ab62cbdb3c2298715071a08
MD5 53bb38887335ff4d866f5383e520e427
BLAKE2b-256 25abbe754b19c7acea5ad55aa5311f4935ce96d38fb9b10b07ec799efefe6597

See more details on using hashes here.

File details

Details for the file vexy_glob-1.0.9-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for vexy_glob-1.0.9-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 fafc08862efcea87b309525ba0a47f1c827b969ef55ba7f5bdb9a91b59a9a324
MD5 4141c35cae85e8712e18b666c684902f
BLAKE2b-256 52a25d3c74a12fa93f6567e4c3d69c255b7cdcba58e6971f25c7b7672a288f53

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page