Skip to main content

Toolkit for working with Common Index File Format (CIFF) files.

Project description

CIFF Toolkit

This repository contains a Python toolkit for working with Common Index File Format (CIFF) files.

Specifically, it provides a CiffReader and CiffWriter for easily reading and writing CIFF files. It also provides a handful of CLI tools, such as merging a CIFF file or dumping its contents.

Installation

To use the CIFF toolkit, install it from PyPI:

$ pip install ciff-toolkit

Usage

Reading

To read a CIFF file, you can use the CiffReader class. It returns the posting lists and documents as lazy generators, so operations that need to process large CIFF files do not need to load the entire index into memory.

The CiffReader can be used as a context manager, automatically opening files if a path is supplied as a str or pathlib.Path.

from ciff_toolkit.read import CiffReader

with CiffReader('./path/to/index.ciff') as reader:
    header = reader.read_header()

    for pl in reader.read_postings_lists():
        print(pl)

    for doc in reader.read_documents():
        print(doc)

Alternatively, the CiffReader also accepts iterables of bytes instead of file paths. This could be useful if, for instance, the index is in a remote location:

import requests
from ciff_toolkit.read import CiffReader

url = 'https://example.com/remote-index.ciff'
with CiffReader(requests.get(url, stream=True).iter_content(1024)) as reader:
    header = reader.read_header()
    ...

Writing

The CiffWriter offers a similar context manager API:

from ciff_toolkit.ciff_pb2 import Header, PostingsList, DocRecord
from ciff_toolkit.write import CiffWriter

header: Header = ...
postings_lists: list[PostingsList] = ...
doc_records: list[DocRecord] = ...

with CiffWriter('./path/to/index.ciff') as writer:
    writer.write_header(header)
    writer.write_postings_lists(postings_lists)
    writer.write_documents(doc_records)

Command Line Interface

A couple of CLI commands are supported:

  • ciff_dump INPUT

    Dumps the contents of a CIFF file, in order to inspect its contents.

  • ciff_merge INPUT... OUTPUT

    Merges two or more CIFF files into a single CIFF file. Ensures documents and terms are ordered correctly, and will read and write in a streaming manner (i.e. not read all data into memory at once).

    Note: ciff_merge requires that the DocRecord messages occur before the PostingsList messages in the CIFF file, as it needs to remap the internal document identifiers before merging the posting lists. See ciff_swap below for more information on how to achieve that.

  • ciff_swap --input-order [hpd|hdp] INPUT OUTPUT

    Swaps the PostingsList and DocRecord messages in a CIFF file (e.g. in order to prepare for merging). The --input-order argument specifies the current format of the CIFF file: hpd for header - posting lists - documents, and hdp for header - documents - posting lists.

  • ciff_zero_index INPUT OUTPUT

    Takes a CIFF file with 1-indexed documents, and turns it into 0-indexed documents.

Development

This project uses Poetry to manage dependencies, configure the project and publish it to PyPI.

To get started, use Poetry to install all dependencies:

$ poetry install

Then, either activate the virtual environment to execute all Python code in the virtual environment, or prepend every command with poetry run.

$ poetry shell
(venv) $ ciff_dump index.ciff

or:

$ poetry run ciff_dump index.ciff

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ciff-toolkit-0.1.1.tar.gz (11.2 kB view details)

Uploaded Source

Built Distribution

ciff_toolkit-0.1.1-py3-none-any.whl (12.5 kB view details)

Uploaded Python 3

File details

Details for the file ciff-toolkit-0.1.1.tar.gz.

File metadata

  • Download URL: ciff-toolkit-0.1.1.tar.gz
  • Upload date:
  • Size: 11.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.13 CPython/3.10.4 Linux/6.2.6-76060206-generic

File hashes

Hashes for ciff-toolkit-0.1.1.tar.gz
Algorithm Hash digest
SHA256 361444935f3524d03fb1ca80dc234539dfdf897db6a057cdf60ac75b2a1a3f91
MD5 f957ffd10d0f7ccd44691dad35842336
BLAKE2b-256 aee5fa32c9b820229dab4082ffb3a5e94607d86ef2af2c41c3ad1915f26c81b4

See more details on using hashes here.

File details

Details for the file ciff_toolkit-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: ciff_toolkit-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 12.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.13 CPython/3.10.4 Linux/6.2.6-76060206-generic

File hashes

Hashes for ciff_toolkit-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 701d48028783ae9618a45d1a1a6dfb8d4cbaa3cb268c8bd3e09db5d932fde3c7
MD5 275a8a5cdf172c1d6a7fa9cbc7bc870e
BLAKE2b-256 bb6c2564ac35844265106a121bf0cdc65b224ad81a309f45f10e1c0fd4b47cdd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page