No project description provided
Project description
PyCIFF
This package provides a python interface to PISA's Common Index File Format Import/Export toolset, which is written in Rust.
Usage
Converting CIFF to PISA:
pyciff.ciff_to_pisa(input_file, output, generate_lexicons)
input_file
is the input CIFF file.output
is the basename of the output PISA canonical files.generate_lexicons
is a Boolean flag; if True, the.termlex
and.doclex
files will be created.
Example (using the toy CIFF file stored in this repo):
$> cd tests
$> python -c "import pyciff; pyciff.ciff_to_pisa('toy-complete-20200309.ciff', 'my-pisa-files', False)"
----- CIFF HEADER -----
Version: 1
No. Postings Lists: 9
Total Postings Lists: 9
No. Documents: 3
Total Documents: 3
Total Terms in Collection 16
Average Document Length: 5.333333333333333
Description: Export of toy 3-document collection from Anserini's io.anserini.integration.TrecEndToEndTest test case
-----------------------
Processing postings
[00:00:00] [========================================] / (0s)
Processing document lengths
[00:00:00] [========================================] / (0s)
$> ls my-pisa-files.*
my-pisa-files.docs my-pisa-files.documents my-pisa-files.freqs my-pisa-files.sizes my-pisa-files.terms
Converting PISA to CIFF:
pyciff.pisa_to_ciff(collection_input, terms_input, titles_input, output, description)
collection_input
is the basename of the (canonical PISA) collection.terms_input
is a newline delimited file containing a single term per line (the first line is the 0-th postings list).titles_input
is a newline delimited file containing a single document identifier per line (the first line is the 0-th document identifier).output
is the name of the CIFF file to output.description
is stored inside the CIFF blob, and can be used to describe the collection/parsing/etc.
Example using the example files created previously:
# Still working in `tests` directory
$> python3 -c "import pyciff; pyciff.pisa_to_ciff('my-pisa-files', 'my-pisa-files.terms', 'my-pisa-files.documents', 'my-new.ciff', 'My example description')"
Collecting posting lists statistics
[00:00:00] [========================================] / (0s)
Computing average document length
[00:00:00] [========================================] / (0s)
Writing postings
[00:00:00] [========================================] / (0s)
$> ls my-new.ciff
my-new.ciff
Deployment
To upload to Pypi:
docker run --rm -v (pwd):/io konstin2/maturin publish -r https://test.pypi.org/legacy/ -u USER -p PASSWORD
To upload to Test Pypi:
docker run --rm -v (pwd):/io konstin2/maturin publish -u USER -p PASSWORD
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pyciff-0.2.0.tar.gz
(8.3 kB
view hashes)
Built Distributions
Close
Hashes for pyciff-0.2.0-pp37-pypy37_pp73-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b7aecb54fd6fd5b00c16d553a0a4202aa45b12c98cfc40ced8f65643fb65c0f9 |
|
MD5 | b93b6c2f53c1deb5b2710ef2cff9644d |
|
BLAKE2b-256 | 0687c9bc9e0f8a04d3899af6375d371b1cc04cb788974b1c973019aeca2e7d35 |
Close
Hashes for pyciff-0.2.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d802dc5513aefb015a0781bd8ef91325f6ff1195527cd74654566383a0d111e6 |
|
MD5 | aa07a037641aa31f4f57e530b95033a3 |
|
BLAKE2b-256 | 2340b33908d2ca4bdc04da9ef3e5ea39881c17c7fe57d9da2f4dc9fe4010fb07 |
Close
Hashes for pyciff-0.2.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2b6f42c61dc37451d3308532c28dc0b6b3601954ebb39a52b9aac10c20dc5e25 |
|
MD5 | 72623dfdb62557ae8369b4f00b75d8a6 |
|
BLAKE2b-256 | 9a57767ecb6e033951125179b601feef3c6df6b25af42445c18a33b8aae768de |
Close
Hashes for pyciff-0.2.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 41e4aaa619d075961a50774d2ae185556461a12d115103670b326fc5b90cd674 |
|
MD5 | 597a40a0ef3bcf5064c29f8fbc832d99 |
|
BLAKE2b-256 | 853a5080798eebec88e66b92ddb814caef919565566df2446abb3c7354418a11 |
Close
Hashes for pyciff-0.2.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 842e6dc3f6dd44ec8af306085aa4aabf3522c45f2d5b6b25914d79000442214d |
|
MD5 | 7b85db07cfac2b8ff99f0d8db9c19fc5 |
|
BLAKE2b-256 | f0079b52ac2828ccfbf71bcedaed840b9d66d8a1f4acc3e3202eb013a329bdb2 |
Close
Hashes for pyciff-0.2.0-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a2b3c63eb00bd36d2e5543d1840438795708764cc09497d8a2e3d3dd1ef6cc8c |
|
MD5 | e71f646dde960f3092b2a61aa749a9ea |
|
BLAKE2b-256 | 63c13822983e45b8f2fe20605ba5ea4f85155b12216372380ab5b50d95bb94dd |