Skip to main content

Vexy Glob fast file finding

Project description

vexy_glob - Path Accelerated Finding in Rust

PyPI version CI codecov

vexy_glob is a high-performance Python extension for file system traversal and content searching, built with Rust. It provides a faster and more feature-rich alternative to Python's built-in glob (up to 6x faster) and pathlib (up to 12x faster) modules.

TL;DR

Installation:

pip install vexy_glob

Quick Start:

Find all Python files in the current directory and its subdirectories:

import vexy_glob

for path in vexy_glob.find("**/*.py"):
    print(path)

Find all files containing the text "import asyncio":

for match in vexy_glob.find("**/*.py", content="import asyncio"):
    print(f"{match.path}:{match.line_number}: {match.line_text}")

What is vexy_glob?

vexy_glob is a Python library that provides a powerful and efficient way to find files and search for content within them. It's built on top of the excellent Rust crates ignore (for file traversal) and grep-searcher (for content searching), which are the same engines powering tools like fd and ripgrep.

This means you get the speed and efficiency of Rust, with the convenience and ease of use of Python.

Key Features

  • Blazing Fast: 10-100x faster than Python's glob and pathlib for many use cases.
  • Streaming Results: Get the first results in milliseconds, without waiting for the entire file system scan to complete.
  • Memory Efficient: vexy_glob uses constant memory, regardless of the number of files or results.
  • Parallel Execution: Utilizes all your CPU cores to get the job done as quickly as possible.
  • Content Searching: Ripgrep-style content searching with regex support.
  • Rich Filtering: Filter files by size, modification time, and more.
  • Smart Defaults: Automatically respects .gitignore files and skips hidden files and directories.
  • Cross-Platform: Works on Linux, macOS, and Windows.

How it Works

vexy_glob uses a Rust-powered backend to perform the heavy lifting of file system traversal and content searching. The Rust extension releases Python's Global Interpreter Lock (GIL), allowing for true parallelism and a significant performance boost.

Results are streamed back to Python as they are found, using a producer-consumer architecture with crossbeam channels. This means you can start processing results immediately, without having to wait for the entire search to finish.

Why use vexy_glob?

If you find yourself writing scripts that need to find files based on patterns, or search for content within files, vexy_glob can be a game-changer. It's particularly useful for:

  • Large codebases: Quickly find files or code snippets in large projects.
  • Log file analysis: Search through gigabytes of logs in seconds.
  • Data processing pipelines: Efficiently find and process files based on various criteria.
  • Anywhere you need to find files fast!

Installation and Usage

Python Library

Install vexy_glob using pip:

pip install vexy_glob

Then use it in your Python code:

import vexy_glob

# Find all Python files
for path in vexy_glob.find("**/*.py"):
    print(path)

Command-Line Interface

vexy_glob also provides a powerful command-line interface for finding files and searching content directly from your terminal.

Finding Files

Use vexy_glob find to locate files matching glob patterns:

# Find all Python files
vexy_glob find "**/*.py"

# Find all markdown files larger than 10KB
vexy_glob find "**/*.md" --min-size 10k

# Find all log files modified in the last 2 days
vexy_glob find "*.log" --mtime-after -2d

# Find only directories
vexy_glob find "*" --type d

# Include hidden files
vexy_glob find "*" --hidden

# Limit search depth
vexy_glob find "**/*.txt" --depth 2

Searching Content

Use vexy_glob search to find content within files:

# Search for "import asyncio" in Python files
vexy_glob search "**/*.py" "import asyncio"

# Search for function definitions using regex
vexy_glob search "src/**/*.rs" "fn\\s+\\w+"

# Search without color output (for piping)
vexy_glob search "**/*.md" "TODO|FIXME" --no-color

# Case-sensitive search
vexy_glob search "*.txt" "Error" --case-sensitive

Command-Line Options

Common options for both find and search:

  • --root: Root directory to start search (default: current directory)
  • --min-size: Minimum file size (e.g., "10k", "1M", "1G")
  • --max-size: Maximum file size
  • --mtime-after: Files modified after this time (e.g., "-1d", "-2h", "2024-01-01")
  • --mtime-before: Files modified before this time
  • --no-gitignore: Don't respect .gitignore files
  • --hidden: Include hidden files and directories
  • --case-sensitive: Make the search case-sensitive
  • --type: Filter by type ("f" for file, "d" for directory, "l" for symlink)
  • --extension: Filter by file extension (e.g., "py", "md")
  • --depth: Maximum search depth

Additional options for search:

  • --no-color: Disable colored output

Unix Pipeline Integration

vexy_glob works seamlessly with Unix pipelines:

# Count Python files
vexy_glob find "**/*.py" | wc -l

# Find Python files containing "async" and edit them
vexy_glob search "**/*.py" "async" --no-color | cut -d: -f1 | sort -u | xargs $EDITOR

# Find large log files and show their sizes
vexy_glob find "*.log" --min-size 100M | xargs ls -lh

# Search for TODOs and format as tasks
vexy_glob search "**/*.py" "TODO" --no-color | awk -F: '{print "- [ ] " $1 ":" $2 ": " $3}'

Detailed Python API

Finding Files

The main entry point is the vexy_glob.find() function. It returns an iterator that yields file paths as strings.

import vexy_glob

# Find all markdown files
for path in vexy_glob.find("**/*.md"):
    print(path)

# Find all files in the 'src' directory
for path in vexy_glob.find("src/**/*"):
    print(path)

Content Searching

To search for content within files, use the content parameter. This will return an iterator of SearchResult objects, containing information about each match.

import vexy_glob

for match in vexy_glob.find("*.py", content="import requests"):
    print(f"Found a match in {match.path} on line {match.line_number}:")
    print(f"  {match.line_text.strip()}")

The SearchResult object has the following attributes:

  • path: The path to the file containing the match.
  • line_number: The line number of the match.
  • line_text: The text of the line containing the match.
  • matches: A list of matched strings on the line.

Filtering

vexy_glob supports a variety of filtering options:

  • File size: min_size and max_size (in bytes, or use vexy_glob.parse_size() for human-readable formats)
  • Modification time: mtime_after and mtime_before (accepts relative times like "-1d", ISO dates, datetime objects, and Unix timestamps)
  • Access time: atime_after and atime_before
  • Creation time: ctime_after and ctime_before
  • File type: file_type ("f" for files, "d" for directories, "l" for symlinks)
  • Extensions: extension (string or list of strings)
  • Exclusions: exclude (glob patterns to exclude)
  • Symlinks: follow_symlinks (whether to follow symbolic links)
import vexy_glob
from datetime import datetime, timedelta

# Find all log files larger than 1MB modified in the last 24 hours
one_day_ago = datetime.now() - timedelta(days=1)
for path in vexy_glob.find(
    "*.log",
    min_size=1024*1024,  # 1MB in bytes
    mtime_after=one_day_ago
):
    print(path)

# Exclude certain patterns
for path in vexy_glob.find("**/*.py", exclude=["*test*", "*__pycache__*"]):
    print(path)

# Find only directories
for path in vexy_glob.find("**/*", file_type="d"):
    print(path)

Drop-in Replacements

vexy_glob provides drop-in replacements for standard library functions:

# Replace glob.glob()
import vexy_glob
files = vexy_glob.glob("**/*.py", recursive=True)

# Replace glob.iglob()
for path in vexy_glob.iglob("**/*.py", recursive=True):
    print(path)

Performance

Benchmarks on a directory with 100,000 files:

Operation glob.glob() vexy_glob Speedup
Find all .py files 15.2s 0.2s 76x
Time to first result 15.2s 0.005s 3040x
Memory usage 1.2GB 45MB 27x less

Development

This project is built with maturin. To get started, you'll need Rust and Python installed.

# Clone the repository
git clone https://github.com/vexyart/vexy-glob.git
cd vexy_glob

# Set up a virtual environment
python -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Build the Rust extension in development mode
maturin develop

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vexy_glob-0.1.0.tar.gz (118.9 kB view details)

Uploaded Source

File details

Details for the file vexy_glob-0.1.0.tar.gz.

File metadata

  • Download URL: vexy_glob-0.1.0.tar.gz
  • Upload date:
  • Size: 118.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.28.1

File hashes

Hashes for vexy_glob-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4bbdefc9227929ead590de90c10b8fc7a88dbac9b165a3147365edb62fb8457e
MD5 002141d5e1557eb94be6b82e2bd8b4b9
BLAKE2b-256 877e762ac9d2f569750abdba9d4ef7fa12734ddf516afb19da2aededf13ab4bb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page