Skip to main content

Fast line-based random access to large text files with optional compression

Project description

LineIndex: Fast Line-based Random Access for Text Files

PyPI version Python Version License: MIT

LineIndex provides lightning-fast random access to lines in large text files through efficient indexing. It's designed to handle very large files where you need to frequently access specific lines without reading the entire file.

Key Features

  • O(1) Random Access: Get any line by its number in constant time
  • Memory Efficient: Uses memory mapping and lazy loading
  • Optional Compression: Transparently handles BGZF compressed files
  • Parallel Processing: Multi-threaded line retrieval for batch operations
  • Simple API: Clean Pythonic interface with slice notation support
  • Command Line Tool: Easy access from shell scripts

Installation

# Basic installation
pip install lineindex

# With compression support
pip install lineindex[compression]

# For developers
pip install lineindex[dev]

Quick Start

Python API

from lineindex import LineIndex

# Create an index for a large file
db = LineIndex("bigfile.txt")

# Get a single line
line = db[1000]  # get the 1001st line (0-indexed)

# Get a range of lines
lines = db[1000:1010]  # get 10 lines

# Get every other line in a range
lines = db[1000:1100:2]  # get every other line

# Use parallel processing for better performance with large slices
lines = db.get(1000:2000, workers=-1)  # use all available CPU cores

# With header skipping (useful for CSV files)
db = LineIndex("data.csv", header=True)
first_data_row = db[0]  # skips the header row

# With compression
db = LineIndex("bigfile.txt", compress=True)  # Creates bigfile.txt.dz

# Using the example module (creates example.txt in current directory)
from lineindex import example
db = LineIndex("example.txt")  # Use the auto-created example file

# Or create a custom example
from lineindex.example import create_example_file
create_example_file("custom.txt", num_lines=5000)
db = LineIndex("custom.txt")

Command Line Interface

# Index and compress a file
lineindex file bigfile.txt --compress

# Get a single line
lineindex file bigfile.txt 1000

# Get a range of lines
lineindex file bigfile.txt 1000:1010

# Get every other line with line numbers
lineindex file bigfile.txt 1000:1100:2 --line-numbers

# Use multiple threads for better performance
lineindex file bigfile.txt 1000:2000 --threads 4

# Skip header line (useful for CSV files)
lineindex file data.csv 0 --header

# Create an example file with 1000 lines
lineindex example

# Create an example file with custom number of lines
lineindex example --lines 5000 --output my_example.txt

Note: For backward compatibility, you can omit the file command, e.g., lineindex bigfile.txt 1000.

How It Works

LineIndex creates a binary index file (.idx) containing the byte offset of each line in the file. This allows for O(1) access to any line by seeking directly to its byte position. The index is created once and reused for subsequent accesses.

For compressed files, LineIndex uses the BGZF format (via the idzip package) which preserves random access capabilities despite compression.

Performance

LineIndex is designed for high performance:

  • Uses memory mapping for efficient file access
  • Employs vectorized NumPy operations for batch retrieval
  • Supports multi-threaded line fetching
  • Optimizes disk access patterns

Use Cases

  • Log Analysis: Quickly access specific log entries by line number
  • Data Processing: Extract samples from large datasets without loading everything
  • Text Mining: Randomly access lines for batch processing
  • Machine Learning: Efficiently retrieve training examples from large text corpora

Requirements

  • Python 3.8 or higher
  • NumPy
  • Tqdm (for progress bars)
  • python-idzip (optional, for compression support)

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lineindex-0.1.3.tar.gz (13.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lineindex-0.1.3-py3-none-any.whl (10.4 kB view details)

Uploaded Python 3

File details

Details for the file lineindex-0.1.3.tar.gz.

File metadata

  • Download URL: lineindex-0.1.3.tar.gz
  • Upload date:
  • Size: 13.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for lineindex-0.1.3.tar.gz
Algorithm Hash digest
SHA256 c4dadbb84b0cc0d1f89c1b5780cb0b33c6ee0b8ddd000950ec895ca1164016f1
MD5 254dfa7a0f6f8225a2e579212e69129d
BLAKE2b-256 9317b55fb24541e2bc3c93713733e19e402c5f0908c640a221069d03e5f6eff3

See more details on using hashes here.

File details

Details for the file lineindex-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: lineindex-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 10.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for lineindex-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 8fba048e15914950091a5b4ca8768c4371d0ce60eda196500842445f4914b946
MD5 ebab56d2056557d1a549034a3d15f965
BLAKE2b-256 db92ed2cedfad3f39436887f6ea51128cd0e644e4eaf2ee9342b6bf8c86d807f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page