Skip to main content

Python-first active learning engine backed by libhicalengine

Project description

hical

CI Package PyPI

hical helps you review large document collections more efficiently. Instead of working through a corpus uniformly, you give it a few relevant examples and review feedback, and it keeps reranking the remaining documents so the next batch is more likely to matter.

Its primary use is helping reviewers find more relevant documents faster. It is also useful for building labeled datasets and evaluation or test collections without judging the entire corpus by hand.

In practice, hical gives you a Python workflow to build a corpus dataset from raw documents or ir_datasets, open it, and run interactive review sessions.

The source lives in the CALEngine repository. The intended user-facing surface is the hical Python package.

Installation

Install the package from PyPI:

python -m pip install hical

If you want ir_datasets support:

python -m pip install "hical[datasets]"

If you are building from a source checkout, see docs/developer/development.md.

Quickstart

Create a tiny JSONL corpus:

cat > docs.jsonl <<'EOF'
{"id": "doc-1", "title": "Florida citrus", "body": "Oranges and groves across central Florida."}
{"id": "doc-2", "title": "Coastal cleanup", "body": "Shoreline cleanup and beach restoration projects."}
EOF

Create a minimal config:

cat > corpus.yaml <<'EOF'
input:
  format: jsonl
  path: ./docs.jsonl
  doc_id_field: id
  text_fields:
    - title
    - body
output:
  path: ./docs.bin
  min_df: 1
  build_threads: 2
  parallel_docs_per_chunk: 50000
  optimize_for_fast_load: true
EOF

Build the corpus:

hical-build-corpus --config corpus.yaml

Open the dataset and start reviewing:

import hical

dataset = hical.open_dataset("docs.bin")
session = dataset.start_session(
    relevant_seeds=["florida oranges"],
    review_batch_size=2,
    retraining="auto",
    retrain_every_n_judgments=2,
    session_id="demo-review",
)

batch = session.next_batch()
for item in batch:
    print(item.doc_id, item.score)

session.judge_relevant(batch[0])

The normal flow is:

  1. build a .bin
  2. open it as a Dataset
  3. start a Session
  4. fetch documents and record judgments

For the purpose and workflow at a higher level, see docs/overview.md.

Common Tasks

Build your own corpus

Use hical-build-corpus with JSONL, CSV, TSV, archive, or ir_datasets input. For working configs and sample inputs, see:

Use with ir_datasets

Build directly from a dataset id:

hical-build-ir-dataset --dataset-id cranfield --output ./cranfield.bin

Inspect fields first when the document type has multiple useful fields:

hical-build-ir-dataset --dataset-id beir/msmarco --list-fields

Then choose specific fields to combine:

hical-build-ir-dataset \
  --dataset-id beir/msmarco \
  --text-field title \
  --text-field text \
  --output ./msmarco.bin

For more, see docs/ir-datasets.md.

Use the Python API

The main public entry points are:

  • hical.build_corpus
  • hical.build_ir_dataset_corpus
  • hical.inspect_ir_dataset
  • hical.build_fast_load_index
  • hical.open_dataset
  • dataset.start_session
  • session.next_batch

For the fuller dataset/session workflow, see docs/python-api.md.

Documentation

Supported Platforms

Published wheels are smoke-tested on:

  • Linux x86_64
  • macOS x86_64
  • macOS arm64

Contributing

If you want to work on the repository internals rather than just use the Python package, start here:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

hical-0.2.2-cp313-cp313-manylinux_2_28_x86_64.whl (13.0 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ x86-64

hical-0.2.2-cp313-cp313-macosx_14_0_x86_64.whl (15.2 MB view details)

Uploaded CPython 3.13macOS 14.0+ x86-64

hical-0.2.2-cp313-cp313-macosx_14_0_arm64.whl (15.1 MB view details)

Uploaded CPython 3.13macOS 14.0+ ARM64

hical-0.2.2-cp312-cp312-manylinux_2_28_x86_64.whl (13.0 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

hical-0.2.2-cp312-cp312-macosx_14_0_x86_64.whl (15.2 MB view details)

Uploaded CPython 3.12macOS 14.0+ x86-64

hical-0.2.2-cp312-cp312-macosx_14_0_arm64.whl (15.1 MB view details)

Uploaded CPython 3.12macOS 14.0+ ARM64

hical-0.2.2-cp311-cp311-manylinux_2_28_x86_64.whl (13.0 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

hical-0.2.2-cp311-cp311-macosx_14_0_x86_64.whl (15.2 MB view details)

Uploaded CPython 3.11macOS 14.0+ x86-64

hical-0.2.2-cp311-cp311-macosx_14_0_arm64.whl (15.1 MB view details)

Uploaded CPython 3.11macOS 14.0+ ARM64

hical-0.2.2-cp310-cp310-manylinux_2_28_x86_64.whl (13.0 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

hical-0.2.2-cp310-cp310-macosx_14_0_x86_64.whl (15.2 MB view details)

Uploaded CPython 3.10macOS 14.0+ x86-64

hical-0.2.2-cp310-cp310-macosx_14_0_arm64.whl (15.1 MB view details)

Uploaded CPython 3.10macOS 14.0+ ARM64

File details

Details for the file hical-0.2.2-cp313-cp313-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hical-0.2.2-cp313-cp313-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 19e42df1c7928dad0b52a608b1b21a052b6ec5676ab895b08b7bf36b65febe85
MD5 669b1462d8e7691e97a375940a4bc5e1
BLAKE2b-256 150ef4b64587bc7b5c89c0643b710a432ae235a6d9e2929ce77e5dd440389cea

See more details on using hashes here.

Provenance

The following attestation bundles were made for hical-0.2.2-cp313-cp313-manylinux_2_28_x86_64.whl:

Publisher: package.yml on gathera/CALEngine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hical-0.2.2-cp313-cp313-macosx_14_0_x86_64.whl.

File metadata

File hashes

Hashes for hical-0.2.2-cp313-cp313-macosx_14_0_x86_64.whl
Algorithm Hash digest
SHA256 e8adab93024fda0f2a037a283448d9ab3c10cd88e9515dab5432c91eb1a4e3cf
MD5 ad1992fef01fc5f3460298052a7943da
BLAKE2b-256 ff8eb1b101dc74b568ce7cd2805c6e07f1c0307b0edb9cd8afe02a13616db9c3

See more details on using hashes here.

Provenance

The following attestation bundles were made for hical-0.2.2-cp313-cp313-macosx_14_0_x86_64.whl:

Publisher: package.yml on gathera/CALEngine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hical-0.2.2-cp313-cp313-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for hical-0.2.2-cp313-cp313-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 eeef3491b8c62091bab06666d742bd77db2d9dab549642e82bb1bd6b79f46a12
MD5 0f6d8d1d247eb61040975799ca71d3e3
BLAKE2b-256 1440c963a228a8e7a07302e44128eb029ffeb2e0d8c3a46b93e27774d9181ceb

See more details on using hashes here.

Provenance

The following attestation bundles were made for hical-0.2.2-cp313-cp313-macosx_14_0_arm64.whl:

Publisher: package.yml on gathera/CALEngine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hical-0.2.2-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hical-0.2.2-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 39af86ad420f4b77db834ccfba279cadad478a84500ef383eabcee1bef6c51ab
MD5 f2f70d01f6323f1cac1fd09a627d3e84
BLAKE2b-256 eaa44d5cef9ce639b66ef539f1d2f0af94d7570f21feda60038217ab717f2d9a

See more details on using hashes here.

Provenance

The following attestation bundles were made for hical-0.2.2-cp312-cp312-manylinux_2_28_x86_64.whl:

Publisher: package.yml on gathera/CALEngine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hical-0.2.2-cp312-cp312-macosx_14_0_x86_64.whl.

File metadata

File hashes

Hashes for hical-0.2.2-cp312-cp312-macosx_14_0_x86_64.whl
Algorithm Hash digest
SHA256 ad8067a240e6a22f4ea5b57f63fe9bdb2db82290d18de8abff30e591430468a9
MD5 ea4d9fb705f39d926debe23fd95a5b6b
BLAKE2b-256 a42466333754c3a7aa82f21a9416abc09947f1fcd22ce510237e206f7a5ec716

See more details on using hashes here.

Provenance

The following attestation bundles were made for hical-0.2.2-cp312-cp312-macosx_14_0_x86_64.whl:

Publisher: package.yml on gathera/CALEngine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hical-0.2.2-cp312-cp312-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for hical-0.2.2-cp312-cp312-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 1bb4f62b553b05b7299e95cb7c87d3da64d41ef0c39343746cb121923da68906
MD5 9edcd09b6e84dce9e9203b5102dfe444
BLAKE2b-256 652a5558b1d54f56709d1cda79f29533d23feff09f410d756ae345dae04414ea

See more details on using hashes here.

Provenance

The following attestation bundles were made for hical-0.2.2-cp312-cp312-macosx_14_0_arm64.whl:

Publisher: package.yml on gathera/CALEngine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hical-0.2.2-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hical-0.2.2-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 9978eb86b1df9d2d532357b4d13e245cd152496fd0d01d407b28061a991cb8ed
MD5 2d6cbcb298ce0c12975fe4c62a278974
BLAKE2b-256 5354959a189bd24463b055a84c2b7fa621c25d036ac6f45e107af8268cade577

See more details on using hashes here.

Provenance

The following attestation bundles were made for hical-0.2.2-cp311-cp311-manylinux_2_28_x86_64.whl:

Publisher: package.yml on gathera/CALEngine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hical-0.2.2-cp311-cp311-macosx_14_0_x86_64.whl.

File metadata

File hashes

Hashes for hical-0.2.2-cp311-cp311-macosx_14_0_x86_64.whl
Algorithm Hash digest
SHA256 f7806f20cf6900f9f426748cfdadd7fe26951bd0a3cd25a710d5808fb477a43b
MD5 6e5a75d2cb75b04e4d1593f8dc602d0c
BLAKE2b-256 c589d7ea838d88645389f2c07c10ccc2540af1fc4f23640452fd2d3b69577f82

See more details on using hashes here.

Provenance

The following attestation bundles were made for hical-0.2.2-cp311-cp311-macosx_14_0_x86_64.whl:

Publisher: package.yml on gathera/CALEngine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hical-0.2.2-cp311-cp311-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for hical-0.2.2-cp311-cp311-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 02f845d4dc40d42e0f4d2044ebcceb36f6d3a62217496e98a5c63721e5dc519b
MD5 76019d94b3758f3637fa9cd47a1765c7
BLAKE2b-256 0cbfdf704cc0b1d1306e4458299310c139f6ae2db6fa9d8068dc4a70b30bfa5a

See more details on using hashes here.

Provenance

The following attestation bundles were made for hical-0.2.2-cp311-cp311-macosx_14_0_arm64.whl:

Publisher: package.yml on gathera/CALEngine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hical-0.2.2-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hical-0.2.2-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 06e1ff82149935f95ebd7f43a75a9ef2011a5a318996ee3741f83ee37e79955d
MD5 70c84a9b77247661b5caa88c02474728
BLAKE2b-256 763033be2f13e3601ed0acea61d71dd08761269b22845834cf7f6404fa39cf39

See more details on using hashes here.

Provenance

The following attestation bundles were made for hical-0.2.2-cp310-cp310-manylinux_2_28_x86_64.whl:

Publisher: package.yml on gathera/CALEngine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hical-0.2.2-cp310-cp310-macosx_14_0_x86_64.whl.

File metadata

File hashes

Hashes for hical-0.2.2-cp310-cp310-macosx_14_0_x86_64.whl
Algorithm Hash digest
SHA256 9b2bb71518d5cc1886f7e6db0f3a781f37bb8d4feb0e6bfd7291eb24a2903b08
MD5 b7fa8160e8880c673ec84b24f0a5cd65
BLAKE2b-256 aad0d26e209b75b4556855047558354403f1fc1e7a2669094f49a655f3848529

See more details on using hashes here.

Provenance

The following attestation bundles were made for hical-0.2.2-cp310-cp310-macosx_14_0_x86_64.whl:

Publisher: package.yml on gathera/CALEngine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hical-0.2.2-cp310-cp310-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for hical-0.2.2-cp310-cp310-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 5725ee27ab4312c9797e1f10c26f52a4502174a0781e3233e55db823f95a2ded
MD5 43811d41c410d02973aa860e7e6bb973
BLAKE2b-256 acd5ab0a2b9ef36a5aefe8c47528343f82e2c4b4e3e19ecb5b47bda222cf9bb8

See more details on using hashes here.

Provenance

The following attestation bundles were made for hical-0.2.2-cp310-cp310-macosx_14_0_arm64.whl:

Publisher: package.yml on gathera/CALEngine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page