Toolkit for working with Common Index File Format (CIFF) files.
Project description
CIFF Toolkit
This repository contains a Python toolkit for working with Common Index File Format (CIFF) files.
Specifically, it provides a CiffReader
and CiffWriter
for easily reading and writing CIFF files. It also provides a handful of CLI tools, such as merging a CIFF file or dumping its contents.
Installation
To use the CIFF toolkit, install it from PyPI:
$ pip install ciff-toolkit
Usage
Reading
To read a CIFF file, you can use the CiffReader
class. It returns the posting lists and documents as lazy generators, so operations that need to process large CIFF files do not need to load the entire index into memory.
The CiffReader
can be used as a context manager, automatically opening files if a path is supplied as a str
or pathlib.Path
.
from ciff_toolkit.read import CiffReader
with CiffReader('./path/to/index.ciff') as reader:
header = reader.read_header()
for pl in reader.read_postings_lists():
print(pl)
for doc in reader.read_documents():
print(doc)
Alternatively, the CiffReader
also accepts iterables of bytes instead of file paths. This could be useful if, for instance, the index is in a remote location:
import requests
from ciff_toolkit.read import CiffReader
url = 'https://example.com/remote-index.ciff'
with CiffReader(requests.get(url, stream=True).iter_content(1024)) as reader:
header = reader.read_header()
...
Writing
The CiffWriter
offers a similar context manager API:
from ciff_toolkit.ciff_pb2 import Header, PostingsList, DocRecord
from ciff_toolkit.write import CiffWriter
header: Header = ...
postings_lists: list[PostingsList] = ...
doc_records: list[DocRecord] = ...
with CiffWriter('./path/to/index.ciff') as writer:
writer.write_header(header)
writer.write_postings_lists(postings_lists)
writer.write_documents(doc_records)
Command Line Interface
A couple of CLI commands are supported:
-
ciff_dump INPUT
Dumps the contents of a CIFF file, in order to inspect its contents.
-
ciff_merge INPUT... OUTPUT
Merges two or more CIFF files into a single CIFF file. Ensures documents and terms are ordered correctly, and will read and write in a streaming manner (i.e. not read all data into memory at once).
Note:
ciff_merge
requires that theDocRecord
messages occur before thePostingsList
messages in the CIFF file, as it needs to remap the internal document identifiers before merging the posting lists. Seeciff_swap
below for more information on how to achieve that. -
ciff_swap --input-order [hpd|hdp] INPUT OUTPUT
Swaps the
PostingsList
andDocRecord
messages in a CIFF file (e.g. in order to prepare for merging). The--input-order
argument specifies the current format of the CIFF file:hpd
for header - posting lists - documents, andhdp
for header - documents - posting lists. -
ciff_zero_index INPUT OUTPUT
Takes a CIFF file with 1-indexed documents, and turns it into 0-indexed documents.
Development
This project uses Poetry to manage dependencies, configure the project and publish it to PyPI.
To get started, use Poetry to install all dependencies:
$ poetry install
Then, either activate the virtual environment to execute all Python code in the virtual environment, or prepend every command with poetry run.
$ poetry shell
(venv) $ ciff_dump index.ciff
or:
$ poetry run ciff_dump index.ciff
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ciff-toolkit-0.1.1.tar.gz
.
File metadata
- Download URL: ciff-toolkit-0.1.1.tar.gz
- Upload date:
- Size: 11.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.13 CPython/3.10.4 Linux/6.2.6-76060206-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 361444935f3524d03fb1ca80dc234539dfdf897db6a057cdf60ac75b2a1a3f91 |
|
MD5 | f957ffd10d0f7ccd44691dad35842336 |
|
BLAKE2b-256 | aee5fa32c9b820229dab4082ffb3a5e94607d86ef2af2c41c3ad1915f26c81b4 |
File details
Details for the file ciff_toolkit-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: ciff_toolkit-0.1.1-py3-none-any.whl
- Upload date:
- Size: 12.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.13 CPython/3.10.4 Linux/6.2.6-76060206-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 701d48028783ae9618a45d1a1a6dfb8d4cbaa3cb268c8bd3e09db5d932fde3c7 |
|
MD5 | 275a8a5cdf172c1d6a7fa9cbc7bc870e |
|
BLAKE2b-256 | bb6c2564ac35844265106a121bf0cdc65b224ad81a309f45f10e1c0fd4b47cdd |