Skip to main content

Toolkit for working with Common Index File Format (CIFF) files.

Project description

CIFF Toolkit

This repository contains a Python toolkit for working with Common Index File Format (CIFF) files.

Specifically, it provides a CiffReader and CiffWriter for easily reading and writing CIFF files. It also provides a handful of CLI tools, such as merging a CIFF file or dumping its contents.

Installation

To use the CIFF toolkit, install it from PyPI:

$ pip install ciff-toolkit

Usage

Reading

To read a CIFF file, you can use the CiffReader class. It returns the posting lists and documents as lazy generators, so operations that need to process large CIFF files do not need to load the entire index into memory.

The CiffReader can be used as a context manager, automatically opening files if a path is supplied as a str or pathlib.Path.

from ciff_toolkit.read import CiffReader

with CiffReader('./path/to/index.ciff') as reader:
    header = reader.read_header()

    for pl in reader.read_postings_lists():
        print(pl)

    for doc in reader.read_documents():
        print(doc)

Writing

The CiffWriter offers a similar context manager API:

from ciff_toolkit.ciff_pb2 import Header, PostingsList, DocRecord
from ciff_toolkit.write import CiffWriter

header: Header = ...
postings_lists: list[PostingsList] = ...
doc_records: list[DocRecord] = ...

with CiffWriter('./path/to/index.ciff') as writer:
    writer.write_header(header)
    writer.write_postings_lists(postings_lists)
    writer.write_documents(doc_records)

Command Line Interface

A couple of CLI commands are supported:

  • ciff_dump INPUT

    Dumps the contents of a CIFF file, in order to inspect its contents.

  • ciff_merge [-d,--description DESCRIPTION] INPUT... OUTPUT

    Merges two or more CIFF files into a single CIFF file. Ensures documents and terms are ordered correctly, and will read and write in a streaming manner (i.e. not read all data into memory at once). If a document identifier appears in multiple CIFF files, only the last one will be retained.

Development

This project uses Poetry to manage dependencies, configure the project and publish it to PyPI.

To get started, use Poetry to install all dependencies:

$ poetry install

Then, either activate the virtual environment to execute all Python code in the virtual environment, or prepend every command with poetry run.

$ poetry shell
(venv) $ ciff_dump index.ciff

or:

$ poetry run ciff_dump index.ciff

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ciff_toolkit-0.2.2.tar.gz (10.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ciff_toolkit-0.2.2-py3-none-any.whl (13.1 kB view details)

Uploaded Python 3

File details

Details for the file ciff_toolkit-0.2.2.tar.gz.

File metadata

  • Download URL: ciff_toolkit-0.2.2.tar.gz
  • Upload date:
  • Size: 10.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.10.12 Linux/6.17.4-76061704-generic

File hashes

Hashes for ciff_toolkit-0.2.2.tar.gz
Algorithm Hash digest
SHA256 4e832c2baf16b48eaa3f1290e71b2b8c7d768f0a84e11d40ed173d4e826c1666
MD5 a642f9ee533d482cb102795573b9945f
BLAKE2b-256 afdc20b961d6ac14dc8387d0be10bf2c2497d20ac57dd7b8b7bff316a3b9af14

See more details on using hashes here.

File details

Details for the file ciff_toolkit-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: ciff_toolkit-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 13.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.10.12 Linux/6.17.4-76061704-generic

File hashes

Hashes for ciff_toolkit-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 6324b33dcb255ef558106840082d676bf1352f02a5596cd5436909dee70b8de7
MD5 3a4c0f6304168cf4e6e14cf1bf55ce21
BLAKE2b-256 3d62e238ef9fc445f5a823b93b4c54ad1a2e130d42609f686fd805b7b5ba9465

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page