Fast line-based random access to large text files with optional compression
Project description
LineIndex: Fast Line-based Random Access for Text Files
LineIndex provides lightning-fast random access to lines in large text files through efficient indexing. It's designed to handle very large files where you need to frequently access specific lines without reading the entire file.
Key Features
- O(1) Random Access: Get any line by its number in constant time
- Memory Efficient: Uses memory mapping and lazy loading
- Optional Compression: Transparently handles BGZF compressed files
- Parallel Processing: Multi-threaded line retrieval for batch operations
- Simple API: Clean Pythonic interface with slice notation support
- Command Line Tool: Easy access from shell scripts
Installation
# Basic installation
pip install lineindex
# With compression support
pip install lineindex[compression]
# For developers
pip install lineindex[dev]
First Test
# Creates example.txt with 1000 lines in current dir
lineindex example
# Indexes the file
lineindex example.txt
# Fetches line 5
lineindex example.txt 5
# Fetches lines 10 to 20
lineindex example.txt 10:20
Quick Start
Python API
from lineindex import LineIndex
# Create an index for a large file
db = LineIndex("bigfile.txt")
# Get a single line
line = db[1000] # get the 1001st line (0-indexed)
# Get a range of lines
lines = db[1000:1010] # get 10 lines
# Get every other line in a range
lines = db[1000:1100:2] # get every other line
# Use parallel processing for better performance with large slices
lines = db.get(1000:2000, workers=-1) # use all available CPU cores
# With header skipping (useful for CSV files)
db = LineIndex("data.csv", header=True)
first_data_row = db[0] # skips the header row
# With compression
db = LineIndex("bigfile.txt", compress=True) # Creates bigfile.txt.dz
# Using the example module (creates example.txt in current directory)
from lineindex import example
db = LineIndex("example.txt") # Use the auto-created example file
# Or create a custom example
from lineindex.example import create_example_file
create_example_file("custom.txt", num_lines=5000)
db = LineIndex("custom.txt")
Command Line Interface
# Index and compress a file
lineindex bigfile.txt --compress
# Get a single line
lineindex bigfile.txt 1000
# Get a range of lines
lineindex bigfile.txt 1000:1010
# Get every other line with line numbers
lineindex bigfile.txt 1000:1100:2 --line-numbers
# Use multiple threads for better performance
lineindex bigfile.txt 1000:2000 --threads 4
# Skip header line (useful for CSV files)
lineindex data.csv 0 --header
# Create an example file with 1000 lines
lineindex example
# Create an example file with custom number of lines
lineindex example --lines 5000 --output my_example.txt
Note: For backward compatibility, you can omit the
filecommand, e.g.,lineindex bigfile.txt 1000.
How It Works
LineIndex creates a binary index file (.idx) containing the byte offset of each line in the file. This allows for O(1) access to any line by seeking directly to its byte position. The index is created once and reused for subsequent accesses.
For compressed files, LineIndex uses the BGZF format (via the idzip package) which preserves random access capabilities despite compression.
Performance
LineIndex is designed for high performance:
- Uses memory mapping for efficient file access
- Employs vectorized NumPy operations for batch retrieval
- Supports multi-threaded line fetching
- Optimizes disk access patterns
Use Cases
- Log Analysis: Quickly access specific log entries by line number
- Data Processing: Extract samples from large datasets without loading everything
- Text Mining: Randomly access lines for batch processing
- Machine Learning: Efficiently retrieve training examples from large text corpora
Requirements
- Python 3.8 or higher
- NumPy
- Tqdm (for progress bars)
- python-idzip (optional, for compression support)
License
MIT
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lineindex-0.1.4.tar.gz.
File metadata
- Download URL: lineindex-0.1.4.tar.gz
- Upload date:
- Size: 13.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
67713a4f0a4341aaec9e8cf439e307f066d5de9beec749234a15a5b52a9faeca
|
|
| MD5 |
0ba720523a93ad960a4775b7718698f5
|
|
| BLAKE2b-256 |
6a0ae7a38ce8a2115053e6dc91337b4b3ea0bbcc6f9e27365da894ff8c2932b6
|
File details
Details for the file lineindex-0.1.4-py3-none-any.whl.
File metadata
- Download URL: lineindex-0.1.4-py3-none-any.whl
- Upload date:
- Size: 10.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b9d0da56a8ef52b76275d148ed7bb324e70e51bae7e387cfca04e50a8b4f1c4a
|
|
| MD5 |
eafa8f44302bd42a9c3b011476936ac3
|
|
| BLAKE2b-256 |
6b140cff784bcece6b53cd2dc8b8eba7c14e8e007afcdd497aebadfb07b59bc4
|