Skip to main content

Perplexity filter for documents and bulk HTML and WARC boilerplate removal.

Project description

Pyplexity

This package provides a simple interface to apply perplexity filters to any document. A possible use case for this technology could be the removal of boilerplate. Furthermore, it provides a WARC and HTML bulk processor, with distributed capabilities.

Models

Memory intensive but does not scale on CPU.

Model RAM usage Download size Performance
bigrams-cord19 2GB 230MB x
bigrams-bnc 5GB 660MB x
trigrams-cord19 6,6GB 1GB x
trigrams-bnc 14GB 2,2GB x

Installation process

python3 -m pip install pyplexity

Usage example

Compute perplexity from console

Command "perplexity". By default, bigrams-bnc. Argument "--model bigrams-bnc" changes model. Documentation:

citius@pc:~$ pyplexity perplexity --help
Usage: pyplexity perplexity [OPTIONS] TEXT

Arguments:
  TEXT  [required]

Options:
  --model TEXT  [default: bigrams-bnc]
  --help        Show this message and exit.

By default, models are stored in ~/.cache/cached_path/, as per cached-path package documentation. Example:

citius@pc:~$ pyplexity perplexity "this is normal text"
downloading: 100%|##########| 660M/660M [00:11<00:00, 59.0MiB/s]
Loading model... Done.
1844.85540669094
citius@pc:~$ pyplexity perplexity "this is normal HTML PAGE BOI%& 678346 NOR  text"
Loading model... Done.
44787.99199563819

Bulk perplexity computation and cleaning of a directory

Documentation:

citius@pc:~$ pyplexity bulk-perplexity --help
Usage: pyplexity bulk-perplexity [OPTIONS] INPUT_DIR

Arguments:
  INPUT_DIR  [required]

Options:
  --output-dir TEXT                [default: out_dir]
  --model TEXT                     [default: bigrams-bnc]
  --perpl-limit FLOAT              [default: 8000.0]
  --warc-input / --no-warc-input   [default: no-warc-input]
Distributed computing options:
  --distributed / --no-distributed [default: no-distributed]
  --n-workers INTEGER              [default: 1]
  --node INTEGER                   [default: 1]
  --port INTEGER                   [default: 8866]
  --help                           Show this message and exit.

We will explain the distributed computing capabilities later. Input directory is allowed to have recursive subdirectories with files. It can process both WARC and raw text files. WARC containers and HTML files should have been previously tag-cleaned with the command below. Example:

citius@pc:~$ pyplexity bulk-perplexity ./out_dir/ --output-dir cleaned_files --model bigrams-cord19
downloading: 100%|##########| 233M/233M [00:03<00:00, 63.3MiB/s] 
Loading model... Done.
Computed 1124 files in 0:00:01.905390.

Perform HTML tag cleaning of a directory

Documentation:

citius@pc:~$ pyplexity tag-remover --help
Usage: pyplexity tag-remover [OPTIONS] BASE_DIR

Arguments:
  BASE_DIR  [required]

Options:
  --output-dir TEXT                [default: out_dir]
  --warc-input / --no-warc-input   [default: no-warc-input]
Distributed computing options:
  --distributed / --no-distributed [default: no-distributed]
  --n-workers INTEGER              [default: 1]
  --node INTEGER                   [default: 1]
  --port INTEGER                   [default: 8866]
  --help                           Show this message and exit.

We will explain the distributed computing capabilities later. Input directory is allowed to have recursive subdirectories with files. It can process HTML files or WARC files. In this case, it will recompress the WARC efficiently, after stripping out all the tags. Example:

citius@pc:~$ pyplexity tag-remover ./html_source --output-dir ./output
Computed 1124 files in 0:00:00.543175.

Distributed mode (cluster)

Interfacing from Python

Building the package

git clone https://github.com/citiususc/pyplexity && cd pyplexity
curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python3 -
source $HOME/.poetry/env
poetry build
pip3 install dist/pyplexity-X.X.X-py3-none-any.whl

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyplexity-0.2.4.tar.gz (14.7 kB view details)

Uploaded Source

Built Distribution

pyplexity-0.2.4-py3-none-any.whl (10.9 kB view details)

Uploaded Python 3

File details

Details for the file pyplexity-0.2.4.tar.gz.

File metadata

  • Download URL: pyplexity-0.2.4.tar.gz
  • Upload date:
  • Size: 14.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.12 CPython/3.8.10 Linux/5.13.0-30-generic

File hashes

Hashes for pyplexity-0.2.4.tar.gz
Algorithm Hash digest
SHA256 3bec4e6ae432ed548bc51a97975ef66a97c4ae3dec6384fe323aebacb0d696e1
MD5 234cad7a01eb5ce79541b976c41eeada
BLAKE2b-256 15114019433d9a08fc13c75cc72ab23cdceb57f8e5f9b4c7a5e6ed488ef84757

See more details on using hashes here.

Provenance

File details

Details for the file pyplexity-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: pyplexity-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 10.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.12 CPython/3.8.10 Linux/5.13.0-30-generic

File hashes

Hashes for pyplexity-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 f36d734938ba538314d77b74ca2b86b7d38a14874a3589678bd0c56bc96019b7
MD5 2a835141567f9b963fce21efd24d9b78
BLAKE2b-256 df55a93f6c2059e94e742faeb5e8fae59a0ea10096f170c0cad37e6a70ed84f4

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page