Skip to main content

Perplexity filter for documents and bulc HTML and WARC boilerplate removal.

Project description

Pyplexity

This package provides a simple interface to apply perplexity filters to any document. Furthermore, it provides a WARC and HTML bulk processor, with distributed capabilities.

Usage example

Process a folder containing a dataset using a trigrams model.

poetry build
pip3 install dist/pyplexity-0.1.31-py3-none-any.whl
pyplexity bulk-perplexity --perpl-model ../../clueweb-b13-rawtext2/trigrams_bnc.st --perpl-limit 8000.0 \ 
    --trigrams --base-dir ./cleaned_webkb --output-dir ./perpl_filtered_webkb

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyplexity-0.1.32.tar.gz (12.2 kB view details)

Uploaded Source

Built Distribution

pyplexity-0.1.32-py3-none-any.whl (9.8 kB view details)

Uploaded Python 3

File details

Details for the file pyplexity-0.1.32.tar.gz.

File metadata

  • Download URL: pyplexity-0.1.32.tar.gz
  • Upload date:
  • Size: 12.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.12 CPython/3.8.10 Linux/5.13.0-28-generic

File hashes

Hashes for pyplexity-0.1.32.tar.gz
Algorithm Hash digest
SHA256 26e3e827d3bf66bbb638eb9ba88389dd1227446462d87eaf4a11ee08e59a317c
MD5 9997340efbdcbf26330b448edd910106
BLAKE2b-256 f33c7ca6c0f7bbe6bdbf1e039563b637f213a4ef7c4441b25fd003c768a8c820

See more details on using hashes here.

Provenance

File details

Details for the file pyplexity-0.1.32-py3-none-any.whl.

File metadata

  • Download URL: pyplexity-0.1.32-py3-none-any.whl
  • Upload date:
  • Size: 9.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.12 CPython/3.8.10 Linux/5.13.0-28-generic

File hashes

Hashes for pyplexity-0.1.32-py3-none-any.whl
Algorithm Hash digest
SHA256 026c782a62ab3f0ee80fffa38b3a615e0e65d38f90576c563b971773ed021ee4
MD5 fb7b9fa17f37a3c00a87ce5359b8e286
BLAKE2b-256 2ee285d87d5a9f4d0d906b566dceb4fabed840b4ea80ca131fa61221abf67b6d

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page