Vexy Glob fast file finding
Project description
vexy_glob - Path Accelerated Finding in Rust
vexy_glob is a high-performance Python extension for file system traversal and content searching, built with Rust. It provides a faster and more feature-rich alternative to Python's built-in glob (up to 6x faster) and pathlib (up to 12x faster) modules.
TL;DR
Installation:
pip install vexy_glob
Quick Start:
Find all Python files in the current directory and its subdirectories:
import vexy_glob
for path in vexy_glob.find("**/*.py"):
print(path)
Find all files containing the text "import asyncio":
for match in vexy_glob.find("**/*.py", content="import asyncio"):
print(f"{match.path}:{match.line_number}: {match.line_text}")
What is vexy_glob?
vexy_glob is a Python library that provides a powerful and efficient way to find files and search for content within them. It's built on top of the excellent Rust crates ignore (for file traversal) and grep-searcher (for content searching), which are the same engines powering tools like fd and ripgrep.
This means you get the speed and efficiency of Rust, with the convenience and ease of use of Python.
Key Features
- Blazing Fast: 10-100x faster than Python's
globandpathlibfor many use cases. - Streaming Results: Get the first results in milliseconds, without waiting for the entire file system scan to complete.
- Memory Efficient:
vexy_globuses constant memory, regardless of the number of files or results. - Parallel Execution: Utilizes all your CPU cores to get the job done as quickly as possible.
- Content Searching: Ripgrep-style content searching with regex support.
- Rich Filtering: Filter files by size, modification time, and more.
- Smart Defaults: Automatically respects
.gitignorefiles and skips hidden files and directories. - Cross-Platform: Works on Linux, macOS, and Windows.
How it Works
vexy_glob uses a Rust-powered backend to perform the heavy lifting of file system traversal and content searching. The Rust extension releases Python's Global Interpreter Lock (GIL), allowing for true parallelism and a significant performance boost.
Results are streamed back to Python as they are found, using a producer-consumer architecture with crossbeam channels. This means you can start processing results immediately, without having to wait for the entire search to finish.
Why use vexy_glob?
If you find yourself writing scripts that need to find files based on patterns, or search for content within files, vexy_glob can be a game-changer. It's particularly useful for:
- Large codebases: Quickly find files or code snippets in large projects.
- Log file analysis: Search through gigabytes of logs in seconds.
- Data processing pipelines: Efficiently find and process files based on various criteria.
- Anywhere you need to find files fast!
Installation and Usage
Python Library
Install vexy_glob using pip:
pip install vexy_glob
Then use it in your Python code:
import vexy_glob
# Find all Python files
for path in vexy_glob.find("**/*.py"):
print(path)
Command-Line Interface
vexy_glob also provides a powerful command-line interface for finding files and searching content directly from your terminal.
Finding Files
Use vexy_glob find to locate files matching glob patterns:
# Find all Python files
vexy_glob find "**/*.py"
# Find all markdown files larger than 10KB
vexy_glob find "**/*.md" --min-size 10k
# Find all log files modified in the last 2 days
vexy_glob find "*.log" --mtime-after -2d
# Find only directories
vexy_glob find "*" --type d
# Include hidden files
vexy_glob find "*" --hidden
# Limit search depth
vexy_glob find "**/*.txt" --depth 2
Searching Content
Use vexy_glob search to find content within files:
# Search for "import asyncio" in Python files
vexy_glob search "**/*.py" "import asyncio"
# Search for function definitions using regex
vexy_glob search "src/**/*.rs" "fn\\s+\\w+"
# Search without color output (for piping)
vexy_glob search "**/*.md" "TODO|FIXME" --no-color
# Case-sensitive search
vexy_glob search "*.txt" "Error" --case-sensitive
Command-Line Options
Common options for both find and search:
--root: Root directory to start search (default: current directory)--min-size: Minimum file size (e.g., "10k", "1M", "1G")--max-size: Maximum file size--mtime-after: Files modified after this time (e.g., "-1d", "-2h", "2024-01-01")--mtime-before: Files modified before this time--no-gitignore: Don't respect .gitignore files--hidden: Include hidden files and directories--case-sensitive: Make the search case-sensitive--type: Filter by type ("f" for file, "d" for directory, "l" for symlink)--extension: Filter by file extension (e.g., "py", "md")--depth: Maximum search depth
Additional options for search:
--no-color: Disable colored output
Unix Pipeline Integration
vexy_glob works seamlessly with Unix pipelines:
# Count Python files
vexy_glob find "**/*.py" | wc -l
# Find Python files containing "async" and edit them
vexy_glob search "**/*.py" "async" --no-color | cut -d: -f1 | sort -u | xargs $EDITOR
# Find large log files and show their sizes
vexy_glob find "*.log" --min-size 100M | xargs ls -lh
# Search for TODOs and format as tasks
vexy_glob search "**/*.py" "TODO" --no-color | awk -F: '{print "- [ ] " $1 ":" $2 ": " $3}'
Detailed Python API
Finding Files
The main entry point is the vexy_glob.find() function. It returns an iterator that yields file paths as strings.
import vexy_glob
# Find all markdown files
for path in vexy_glob.find("**/*.md"):
print(path)
# Find all files in the 'src' directory
for path in vexy_glob.find("src/**/*"):
print(path)
Content Searching
To search for content within files, use the content parameter. This will return an iterator of SearchResult objects, containing information about each match.
import vexy_glob
for match in vexy_glob.find("*.py", content="import requests"):
print(f"Found a match in {match.path} on line {match.line_number}:")
print(f" {match.line_text.strip()}")
The SearchResult object has the following attributes:
path: The path to the file containing the match.line_number: The line number of the match.line_text: The text of the line containing the match.matches: A list of matched strings on the line.
Filtering
vexy_glob supports a variety of filtering options:
- File size:
min_sizeandmax_size(in bytes, or usevexy_glob.parse_size()for human-readable formats) - Modification time:
mtime_afterandmtime_before(accepts relative times like"-1d", ISO dates, datetime objects, and Unix timestamps) - Access time:
atime_afterandatime_before - Creation time:
ctime_afterandctime_before - File type:
file_type("f" for files, "d" for directories, "l" for symlinks) - Extensions:
extension(string or list of strings) - Exclusions:
exclude(glob patterns to exclude) - Symlinks:
follow_symlinks(whether to follow symbolic links)
import vexy_glob
from datetime import datetime, timedelta
# Find all log files larger than 1MB modified in the last 24 hours
one_day_ago = datetime.now() - timedelta(days=1)
for path in vexy_glob.find(
"*.log",
min_size=1024*1024, # 1MB in bytes
mtime_after=one_day_ago
):
print(path)
# Exclude certain patterns
for path in vexy_glob.find("**/*.py", exclude=["*test*", "*__pycache__*"]):
print(path)
# Find only directories
for path in vexy_glob.find("**/*", file_type="d"):
print(path)
Drop-in Replacements
vexy_glob provides drop-in replacements for standard library functions:
# Replace glob.glob()
import vexy_glob
files = vexy_glob.glob("**/*.py", recursive=True)
# Replace glob.iglob()
for path in vexy_glob.iglob("**/*.py", recursive=True):
print(path)
Performance
Benchmarks on a directory with 100,000 files:
| Operation | glob.glob() |
vexy_glob |
Speedup |
|---|---|---|---|
Find all .py files |
15.2s | 0.2s | 76x |
| Time to first result | 15.2s | 0.005s | 3040x |
| Memory usage | 1.2GB | 45MB | 27x less |
Development
This project is built with maturin. To get started, you'll need Rust and Python installed.
# Clone the repository
git clone https://github.com/vexyart/vexy-glob.git
cd vexy_glob
# Set up a virtual environment
python -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Build the Rust extension in development mode
maturin develop
License
This project is licensed under the MIT License. See the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file vexy_glob-0.1.0.tar.gz.
File metadata
- Download URL: vexy_glob-0.1.0.tar.gz
- Upload date:
- Size: 118.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.28.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4bbdefc9227929ead590de90c10b8fc7a88dbac9b165a3147365edb62fb8457e
|
|
| MD5 |
002141d5e1557eb94be6b82e2bd8b4b9
|
|
| BLAKE2b-256 |
877e762ac9d2f569750abdba9d4ef7fa12734ddf516afb19da2aededf13ab4bb
|