Skip to main content

indxr: A Python utility for indexing long files.

Project description

PyPI version License: MIT

⚡️ Introduction

indxr [ˈɪnˌdɛksər] is a Python utility that allows to dynamically access specific file lines without loading the entire file into memory. In other words, indxr allows you to use your disk as a RAM extension without noticeable slowdowns, especially with SSDs and NVMEs.

For example, given a 10M lines JOSNl file and a MacBook Pro from 2018, reading any specific line takes less than 10 µs, reading 1k non-contiguous lines takes less than 10 ms, reading 1k contiguous lines takes less than 2 ms, iterating over the entire file by reading batches of 32 lines takes less than 20 s (64 µs per batch).

indxr can be particularly useful for dynamically loading data from large datasets with a low memory footprint and without slowing downstream tasks, such as data processing and Neural Networks training.

For an overview, follow the Usage section.

🔌 Installation

pip install indxr

💡 Usage

TXT

from indxr import Indxr

index = Indxr("sample.txt")

# First line of sample.txt
index[0]

# List containing the second and third lines of sample.txt
index[1:3]

# First line of sample.txt
index.get("0")

# List containing the third and second lines of sample.txt
index.mget(["2", "1"])

JSONl

from indxr import Indxr

index = Indxr("sample.jsonl", key_id="id")  # key_id="id" is by default

# JSON object at line 42 as Python Dictionary
# Reads only the 42nd line
index[42]

# JSON objects at line 42, 43, and 44 as Python Dictionaries
# Reads only the 42nd, 43th, and 44th lines
index[42:46]

# JSON object with id="id_123" as Python Dictionary,
# Reads only the line where the JSON object is located
index.get("id_123")

# Same as `get` but for multiple JSON objects
index.mget(["id_123", "id_321"])

CSV / TSV / ...

from indxr import Indxr

index = Indxr(
  "sample.csv",
  delimiter=",",    # Default value. Automatically switched to `\t` for `.tsv` files.
  fieldnames=None,  # Default value. List of fieldnames. Overrides header, if any.
  has_header=True,  # Default value. If `True`, treats first line as header.
  return_dict=True, # Default value. If `True`, returns Python Dictionary, string otherwise.
  key_id="id",      # Default value. Same as for JSONl. Ignored if return_dict is `False`.
)

# Line 42 as Python Dictionary
index[42]

# Lines 42, 43, and 44 as Python Dictionaries
index[42:46]

# Line with id="id_123" as Python Dictionary
index.get("id_123")

# Same as `get` but for multiple lines
index.mget(["id_123", "id_321"])

Custom

from indxr import Indxr

# The file must have multiple lines
index = Indxr("sample.something")

# First line of sample.something in bytes
index[0]

# List containing the second and third lines of sample.something in bytes
index[1:3]

# First line of sample.something in bytes
index.get("0")

# List containing the third and second lines of sample.something in bytes
index.mget(["2", "1"])

Callback (works with every file-type)

from indxr import Indxr

index = Indxr("sample.txt", callback=lambda x: x.split())

index.get("0")
>>> # First line of sample.txt split into a list

Write / Read Index

from indxr import Indxr

index = Indxr("sample.txt", callback=lambda x: x.split())

index.write(path)  # Write index to disk

# Read index from disk, callback must be re-defined
index = Indxr.read(path, callback=lambda x: x.split())

Usage example with PyTorch Dataset

In this example, we want to build a PyTorch Dataset that returns a query and two documents, one positive and one negative, for training a Neural retriever. The data is stored in two files, queries.jsonl and documents.jsonl. The first file contains queries and the second file contains documents. Each query has a list of associated positive and negative documents. Using Indxr we can avoid loading the entire dataset into memory and we can load data dynamically, without slowing down the training process.

import random

from indxr import Indxr
from torch.utils.data import DataLoader, Dataset

class CustomDataset(Dataset):
    def __init__(self):
      self.queries = Indxr("queries.jsonl")
      self.documents = Indxr("documents.jsonl")

    def __getitem__(self, index: int):
        # Get query ------------------------------------------------------------
        query = self.queries[index]

        # Sampling -------------------------------------------------------------
        pos_doc_id = random.choice(query["pos_doc_ids"])
        neg_doc_id = random.choice(query["neg_doc_ids"])

        # Get docs -------------------------------------------------------------
        pos_doc = self.documents.get(pos_doc_id)
        neg_doc = self.documents.get(neg_doc_id)

        # The outputs must be batched and transformed to
        # meaningful tensors using a DataLoader and
        # a custom collator function
        return query["text"], pos_doc["text"], neg_doc["text"]

    def __len__(self):
        return len(self.queries)


def collator_fn(batch):
    # Extract data -------------------------------------------------------------
    queries = [x[0] for x in batch]
    pos_docs = [x[1] for x in batch]
    neg_docs = [x[2] for x in batch]

    # Texts tokenization -------------------------------------------------------
    queries = tokenizer(queries)    # Returns PyTorch Tensor
    pos_docs = tokenizer(pos_docs)  # Returns PyTorch Tensor
    neg_docs = tokenizer(neg_docs)  # Returns PyTorch Tensor

    return queries, pos_docs, neg_docs


dataloader = DataLoader(
    dataset=CustomDataset(),
    collate_fn=collate_fn,
    batch_size=32,
    shuffle=True,
    num_workers=4,
    prefatch_factor=4,
)

Each line of queries.jsonl is as follows:

{
  "q_id": "q321",
  "text": "lorem ipsum",
  "pos_doc_ids": ["d2789822", "d2558037", "d2594098"],
  "neg_doc_ids": ["d3931445", "d4652233", "d191393", "d3692918", "d3051731"]
}

Each line of documents.jsonl is as follows:

{
  "doc_id": "d123",
  "text": "Lorem ipsum dolor sit amet, consectetuer adipiscing elit."
}

🎁 Feature Requests

Would you like to see other features implemented? Please, open a feature request.

🤘 Want to contribute?

Would you like to contribute? Please, drop me an e-mail.

📄 License

indxr is an open-sourced software licensed under the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

indxr-0.1.6.tar.gz (12.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

indxr-0.1.6-py3-none-any.whl (11.6 kB view details)

Uploaded Python 3

File details

Details for the file indxr-0.1.6.tar.gz.

File metadata

  • Download URL: indxr-0.1.6.tar.gz
  • Upload date:
  • Size: 12.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.14

File hashes

Hashes for indxr-0.1.6.tar.gz
Algorithm Hash digest
SHA256 bea6708663322ddb923701d92dda957bd899ad59222ee3f83388ce33ded0d9e8
MD5 c0c00c0c566e3b3de80693a6c1880c64
BLAKE2b-256 d108256575ea42ae2e65fe01e5270ed33d127df2b7ac4bdc5893723511090505

See more details on using hashes here.

File details

Details for the file indxr-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: indxr-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 11.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.14

File hashes

Hashes for indxr-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 bf5e3ccc6aa7d4abaf3a5d28db34980c72e13e2b5316305a92cd15c8a480cc88
MD5 ca8b2f28714258839b494fef732143b4
BLAKE2b-256 e695a6e9dc98e871bc15cb407c0b260629d25c089a31f4a083eb9ffa5d5d7a0d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page