Toolkit for working with Common Index File Format (CIFF) files.
Project description
CIFF Toolkit
This repository contains a Python toolkit for working with Common Index File Format (CIFF) files.
Specifically, it provides a CiffReader and CiffWriter for easily reading and writing CIFF files. It also provides a handful of CLI tools, such as merging a CIFF file or dumping its contents.
Installation
To use the CIFF toolkit, install it from PyPI:
$ pip install ciff-toolkit
Usage
Reading
To read a CIFF file, you can use the CiffReader class. It returns the posting lists and documents as lazy generators, so operations that need to process large CIFF files do not need to load the entire index into memory.
The CiffReader can be used as a context manager, automatically opening files if a path is supplied as a str or pathlib.Path.
from ciff_toolkit.read import CiffReader
with CiffReader('./path/to/index.ciff') as reader:
header = reader.read_header()
for pl in reader.read_postings_lists():
print(pl)
for doc in reader.read_documents():
print(doc)
Writing
The CiffWriter offers a similar context manager API:
from ciff_toolkit.ciff_pb2 import Header, PostingsList, DocRecord
from ciff_toolkit.write import CiffWriter
header: Header = ...
postings_lists: list[PostingsList] = ...
doc_records: list[DocRecord] = ...
with CiffWriter('./path/to/index.ciff') as writer:
writer.write_header(header)
writer.write_postings_lists(postings_lists)
writer.write_documents(doc_records)
Command Line Interface
A couple of CLI commands are supported:
-
ciff_dump INPUTDumps the contents of a CIFF file, in order to inspect its contents.
-
ciff_merge [-d,--description DESCRIPTION] INPUT... OUTPUTMerges two or more CIFF files into a single CIFF file. Ensures documents and terms are ordered correctly, and will read and write in a streaming manner (i.e. not read all data into memory at once). If a document identifier appears in multiple CIFF files, only the last one will be retained.
Development
This project uses Poetry to manage dependencies, configure the project and publish it to PyPI.
To get started, use Poetry to install all dependencies:
$ poetry install
Then, either activate the virtual environment to execute all Python code in the virtual environment, or prepend every command with poetry run.
$ poetry shell
(venv) $ ciff_dump index.ciff
or:
$ poetry run ciff_dump index.ciff
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ciff_toolkit-0.2.2.tar.gz.
File metadata
- Download URL: ciff_toolkit-0.2.2.tar.gz
- Upload date:
- Size: 10.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.10.12 Linux/6.17.4-76061704-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4e832c2baf16b48eaa3f1290e71b2b8c7d768f0a84e11d40ed173d4e826c1666
|
|
| MD5 |
a642f9ee533d482cb102795573b9945f
|
|
| BLAKE2b-256 |
afdc20b961d6ac14dc8387d0be10bf2c2497d20ac57dd7b8b7bff316a3b9af14
|
File details
Details for the file ciff_toolkit-0.2.2-py3-none-any.whl.
File metadata
- Download URL: ciff_toolkit-0.2.2-py3-none-any.whl
- Upload date:
- Size: 13.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.10.12 Linux/6.17.4-76061704-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6324b33dcb255ef558106840082d676bf1352f02a5596cd5436909dee70b8de7
|
|
| MD5 |
3a4c0f6304168cf4e6e14cf1bf55ce21
|
|
| BLAKE2b-256 |
3d62e238ef9fc445f5a823b93b4c54ad1a2e130d42609f686fd805b7b5ba9465
|