Skip to main content

Fast line-based random access to large text files with optional compression

Project description

LineIndex: Fast Line-based Random Access for Text Files

PyPI version Python Version License: MIT

LineIndex provides lightning-fast random access to lines in large text files through efficient indexing. It's designed to handle very large files where you need to frequently access specific lines without reading the entire file.

Key Features

  • O(1) Random Access: Get any line by its number in constant time
  • Memory Efficient: Uses memory mapping and lazy loading
  • Optional Compression: Transparently handles BGZF compressed files
  • Parallel Processing: Multi-threaded line retrieval for batch operations
  • Simple API: Clean Pythonic interface with slice notation support
  • Command Line Tool: Easy access from shell scripts

Installation

# Basic installation
pip install lineindex

# With compression support
pip install lineindex[compression]

# For developers
pip install lineindex[dev]

Quick Start

Python API

from lineindex import LineIndex

# Create an index for a large file
db = LineIndex("bigfile.txt")

# Get a single line
line = db[1000]  # get the 1001st line (0-indexed)

# Get a range of lines
lines = db[1000:1010]  # get 10 lines

# Get every other line in a range
lines = db[1000:1100:2]  # get every other line

# Use parallel processing for better performance with large slices
lines = db.get(1000:2000, workers=-1)  # use all available CPU cores

# With header skipping (useful for CSV files)
db = LineIndex("data.csv", header=True)
first_data_row = db[0]  # skips the header row

# With compression
db = LineIndex("bigfile.txt", compress=True)  # Creates bigfile.txt.dz

# Using the example module (creates example.txt in current directory)
from lineindex import example
db = LineIndex("example.txt")  # Use the auto-created example file

# Or create a custom example
from lineindex.example import create_example_file
create_example_file("custom.txt", num_lines=5000)
db = LineIndex("custom.txt")

Command Line Interface

# Index and compress a file
lineindex file bigfile.txt --compress

# Get a single line
lineindex file bigfile.txt 1000

# Get a range of lines
lineindex file bigfile.txt 1000:1010

# Get every other line with line numbers
lineindex file bigfile.txt 1000:1100:2 --line-numbers

# Use multiple threads for better performance
lineindex file bigfile.txt 1000:2000 --threads 4

# Skip header line (useful for CSV files)
lineindex file data.csv 0 --header

# Create an example file with 1000 lines
lineindex example

# Create an example file with custom number of lines
lineindex example --lines 5000 --output my_example.txt

Note: For backward compatibility, you can omit the file command, e.g., lineindex bigfile.txt 1000.

How It Works

LineIndex creates a binary index file (.idx) containing the byte offset of each line in the file. This allows for O(1) access to any line by seeking directly to its byte position. The index is created once and reused for subsequent accesses.

For compressed files, LineIndex uses the BGZF format (via the idzip package) which preserves random access capabilities despite compression.

Performance

LineIndex is designed for high performance:

  • Uses memory mapping for efficient file access
  • Employs vectorized NumPy operations for batch retrieval
  • Supports multi-threaded line fetching
  • Optimizes disk access patterns

Use Cases

  • Log Analysis: Quickly access specific log entries by line number
  • Data Processing: Extract samples from large datasets without loading everything
  • Text Mining: Randomly access lines for batch processing
  • Machine Learning: Efficiently retrieve training examples from large text corpora

Requirements

  • Python 3.8 or higher
  • NumPy
  • Tqdm (for progress bars)
  • python-idzip (optional, for compression support)

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lineindex-0.1.2.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lineindex-0.1.2-py3-none-any.whl (10.4 kB view details)

Uploaded Python 3

File details

Details for the file lineindex-0.1.2.tar.gz.

File metadata

  • Download URL: lineindex-0.1.2.tar.gz
  • Upload date:
  • Size: 13.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.17

File hashes

Hashes for lineindex-0.1.2.tar.gz
Algorithm Hash digest
SHA256 4fb6918127e26d32086ef8edef06f91c0e5c35b7766246ab6f7f8789489a59ff
MD5 826cf9dfb33e3098bbfdeaacf8311471
BLAKE2b-256 737ed880015f022dc3325fe5dc303f829df07ad72b0d477114bf795ab7dc23de

See more details on using hashes here.

File details

Details for the file lineindex-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: lineindex-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 10.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.17

File hashes

Hashes for lineindex-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ef45187f865241921d73e25bee49e18fdefe98a428f9a3a323d927afcd4a115a
MD5 9810d392cf1fd4cc3a7a98d1d9e0b5e8
BLAKE2b-256 cfa62f382aa63659b6fb936391ff344ea94087df8888da2da380bad157beb686

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page