indxr: A Python utility for indexing long files.
Project description
⚡️ Introduction
indxr is a Python utility for indexing long files that allows you to read specific lines dynamically, avoiding hogging your RAM.
indxr can be particularly useful for managing large datasets split among multiple files and loading data dynamically and with a low memory footprint.
For an overview, follow the Usage section.
🔌 Installation
pip install indxr
💡 Usage
TXT
from indxr import Indxr
index = indxr("sample.txt")
index[0]
>>> # First line of sample.txt
index.get("0")
>>> # First line of sample.txt
index.mget(["2", "1"])
>>> # List containing the third and second lines of sample.txt
JSONl
from indxr import Indxr
index = indxr("sample.jsonl", key_id="id") # key_id="id" is by default
# Returns the JSON object at line 43 as Python Dictionary
# reading only the line 43th line
index[42]
# Returns the JSON object with id="id_123" as Python Dictionary
# reading only the line of "sample.jsonl" where it is located
index.get("id_123")
# Same as get but for multiple JSON objects
index.mget(["id_123", "id_321"])
CSV / TSV / Custom
from indxr import Indxr
index = indxr(
"sample.csv",
delimiter=",", # Default value. Automatically switched to `\t` for `.tsv` files.
fieldnames=None, # Default value. List of fieldnames. Overrides header, if any.
has_header=True, # Default value. If `True`, treats first line as header.
return_dict=True, # Default value. If `True`, returns Python Dictionary, string otherwise.
key_id="id", # Default value. Same as for JSONl. Ignored if return_dict is `False`.
)
# Returns line 43 as Python Dictionary
index[42]
# Returns the line with id="id_123" as Python Dictionary
index.get("id_123")
# Same as get but for multiple lines
index.mget(["id_123", "id_321"])
Callback (works with every file-type)
from indxr import Indxr
index = indxr("sample.txt", callback=lambda x: x.split())
index.get("0")
>>> # First line of sample.txt split into a list
Write / Read Index
from indxr import Indxr
index = indxr("sample.txt", callback=lambda x: x.split())
index.write(path) # Write index to disk
# Read index from disk, callback must be re-defined
index = Indxr.read(path, callback=lambda x: x.split())
Usage example with PyTorch Dataset
import random
from indxr import Indxr
from torch.utils.data import DataLoader, Dataset
class CustomDataset(Dataset):
def __init__(self):
self.queries = Indxr("queries.jsonl")
self.documents = Indxr("documents.jsonl")
def __getitem__(self, index: int):
# Get query ------------------------------------------------------------
query = self.queries[index]
# Sampling -------------------------------------------------------------
neg_doc_id = random.choice(query["neg_doc_ids"])
neg_doc_id = random.choice(query["neg_doc_ids"])
# Get docs -------------------------------------------------------------
pos_doc = self.documents.get(pos_doc_id)
neg_doc = self.documents.get(neg_doc_id)
# The outputs must be batched and transformed to
# meaningful tensors using a DataLoader and
# a custom collator function
return query["text"], pos_doc["text"], neg_doc["text"]
def __len__(self):
return len(self.queries)
def collator_fn(batch):
# Extract data -------------------------------------------------------------
queries = [x[0] for x in batch]
pos_docs = [x[1] for x in batch]
neg_docs = [x[2] for x in batch]
# Texts tokenization -------------------------------------------------------
queries = tokenizer(queries) # Returns PyTorch Tensor
pos_docs = tokenizer(pos_docs) # Returns PyTorch Tensor
neg_docs = tokenizer(neg_docs) # Returns PyTorch Tensor
return queries, pos_docs, neg_docs
dataloader = DataLoader(
dataset=CustomDataset(),
collate_fn=collate_fn,
batch_size=32,
shuffle=True,
num_workers=4,
)
Each line of the queries.jsonl
file is as follows:
{
"q_id": "q321",
"text": "lorem ipsum",
"pos_doc_ids": ["d2789822", "d2558037", "d2594098"],
"neg_doc_ids": ["d3931445", "d4652233", "d191393", "d3692918", "d3051731"]
}
Each line of the documents.jsonl
file is as follows:
{
"doc_id": "d123",
"text": "Lorem ipsum dolor sit amet, consectetuer adipiscing elit."
}
🎁 Feature Requests
Would you like to see other features implemented? Please, open a feature request.
🤘 Want to contribute?
Would you like to contribute? Please, drop me an e-mail.
📄 License
indxr is an open-sourced software licensed under the MIT license.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.