Skip to main content

DESCRIPTION

Project description

ICDR

Contrastive Data Retrieval with Inverted Indexes

Efficient Approximate/Precise retrieval of similar documents for fine-tuning language models. The library can be used to quickly create contrastive pairs/triplets from large document collections.

ICDR builds an inverted index structure and several fast look-up tables with the aim of retrieving similar texts from a corpus. The library is ideal for efficient entity matching, entity resolution, record linkage, and deduplication applications in the NLP realm. ICDR allows for very fast retrieval of similar, positive (i.e. matching), and negative (i.e. non-matching) text samples which can be used either directly, or to fine-tune LLMs and other models.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

icdr-0.0.12.tar.gz (997.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

icdr-0.0.12-py3-none-any.whl (1.0 MB view details)

Uploaded Python 3

File details

Details for the file icdr-0.0.12.tar.gz.

File metadata

  • Download URL: icdr-0.0.12.tar.gz
  • Upload date:
  • Size: 997.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for icdr-0.0.12.tar.gz
Algorithm Hash digest
SHA256 35d11cb0209dbcfce3d1ad770564ca042fef41a1810da1655efd8c471d6a7923
MD5 863a25cece61698928c6e840ed22f0e2
BLAKE2b-256 96bf445b5b0872539fcf21fa5513ca2a088d92f5b747fcea2fa3690ed7f2a1ac

See more details on using hashes here.

File details

Details for the file icdr-0.0.12-py3-none-any.whl.

File metadata

  • Download URL: icdr-0.0.12-py3-none-any.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for icdr-0.0.12-py3-none-any.whl
Algorithm Hash digest
SHA256 5e2730f159d0ef4076596bc6dd58aa1b1e1410c131154c59daac9d33ff0264ce
MD5 2292d3ea58e82fd0f5e0bf1a05044cd9
BLAKE2b-256 3ffe39a23118af217618aae03c9806585197ed5365b30035f9b4c22f5f0c21f3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page