Skip to main content

Fast line-based random access to large text files with optional compression

Project description

LineIndex: Fast Line-based Random Access for Text Files

PyPI version Python Version License: MIT

LineIndex provides lightning-fast random access to lines in large text files through efficient indexing. It's designed to handle very large files where you need to frequently access specific lines without reading the entire file.

Key Features

  • O(1) Random Access: Get any line by its number in constant time
  • Memory Efficient: Uses memory mapping and lazy loading
  • Optional Compression: Transparently handles BGZF compressed files
  • Parallel Processing: Multi-threaded line retrieval for batch operations
  • Simple API: Clean Pythonic interface with slice notation support
  • Command Line Tool: Easy access from shell scripts

Installation

# Basic installation
pip install lineindex

# With compression support
pip install lineindex[compression]

# For developers
pip install lineindex[dev]

First Test

# Creates example.txt with 1000 lines in current dir
lineindex example 

# Indexes the file
lineindex example.txt

# Fetches line 5
lineindex example.txt 5

# Fetches lines 10 to 20
lineindex example.txt 10:20

Quick Start

Python API

from lineindex import LineIndex

# Create an index for a large file
db = LineIndex("bigfile.txt")

# Get a single line
line = db[1000]  # get the 1001st line (0-indexed)

# Get a range of lines
lines = db[1000:1010]  # get 10 lines

# Get every other line in a range
lines = db[1000:1100:2]  # get every other line

# Use parallel processing for better performance with large slices
lines = db.get(1000:2000, workers=-1)  # use all available CPU cores

# With header skipping (useful for CSV files)
db = LineIndex("data.csv", header=True)
first_data_row = db[0]  # skips the header row

# With compression
db = LineIndex("bigfile.txt", compress=True)  # Creates bigfile.txt.dz

# Using the example module (creates example.txt in current directory)
from lineindex import example
db = LineIndex("example.txt")  # Use the auto-created example file

# Or create a custom example
from lineindex.example import create_example_file
create_example_file("custom.txt", num_lines=5000)
db = LineIndex("custom.txt")

Command Line Interface

# Index and compress a file
lineindex bigfile.txt --compress

# Get a single line
lineindex bigfile.txt 1000

# Get a range of lines
lineindex bigfile.txt 1000:1010

# Get every other line with line numbers
lineindex bigfile.txt 1000:1100:2 --line-numbers

# Use multiple threads for better performance
lineindex bigfile.txt 1000:2000 --threads 4

# Skip header line (useful for CSV files)
lineindex data.csv 0 --header

# Create an example file with 1000 lines
lineindex example

# Create an example file with custom number of lines
lineindex example --lines 5000 --output my_example.txt

Note: For backward compatibility, you can omit the file command, e.g., lineindex bigfile.txt 1000.

How It Works

LineIndex creates a binary index file (.idx) containing the byte offset of each line in the file. This allows for O(1) access to any line by seeking directly to its byte position. The index is created once and reused for subsequent accesses.

For compressed files, LineIndex uses the BGZF format (via the idzip package) which preserves random access capabilities despite compression.

Performance

LineIndex is designed for high performance:

  • Uses memory mapping for efficient file access
  • Employs vectorized NumPy operations for batch retrieval
  • Supports multi-threaded line fetching
  • Optimizes disk access patterns

Use Cases

  • Log Analysis: Quickly access specific log entries by line number
  • Data Processing: Extract samples from large datasets without loading everything
  • Text Mining: Randomly access lines for batch processing
  • Machine Learning: Efficiently retrieve training examples from large text corpora

Requirements

  • Python 3.8 or higher
  • NumPy
  • Tqdm (for progress bars)
  • python-idzip (optional, for compression support)

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lineindex-0.1.4.tar.gz (13.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lineindex-0.1.4-py3-none-any.whl (10.5 kB view details)

Uploaded Python 3

File details

Details for the file lineindex-0.1.4.tar.gz.

File metadata

  • Download URL: lineindex-0.1.4.tar.gz
  • Upload date:
  • Size: 13.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for lineindex-0.1.4.tar.gz
Algorithm Hash digest
SHA256 67713a4f0a4341aaec9e8cf439e307f066d5de9beec749234a15a5b52a9faeca
MD5 0ba720523a93ad960a4775b7718698f5
BLAKE2b-256 6a0ae7a38ce8a2115053e6dc91337b4b3ea0bbcc6f9e27365da894ff8c2932b6

See more details on using hashes here.

File details

Details for the file lineindex-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: lineindex-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 10.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.18

File hashes

Hashes for lineindex-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 b9d0da56a8ef52b76275d148ed7bb324e70e51bae7e387cfca04e50a8b4f1c4a
MD5 eafa8f44302bd42a9c3b011476936ac3
BLAKE2b-256 6b140cff784bcece6b53cd2dc8b8eba7c14e8e007afcdd497aebadfb07b59bc4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page