Skip to main content

cleans gutenberg dataset books

Project description

![](https://i.ibb.co/sCJXhmz/header-sp.png) ![](https://img.shields.io/apm/l/vim-mode.svg)

# gutenberg-cleaner

a python package for cleaning Gutenberg books and dataset.

### Prerequisites nltk package

### Installing ` [sudo] pip install gutenberg-cleaner `

## How to use it?

it has two methods called “simple_cleaner” and “super_cleaner”. ### simple_claner: Just removes lines that are part of the Project Gutenberg header or footer. Doesnt go deeply in the text to remove other things like titles or footnotes or etc… ` simple_cleaner(book: str) -> str ` ### super_cleaner: Super clean the book (titles, footnotes, images, book information, etc.). may delete some good lines too. ` super_cleaner(book: str, min_token: int = 5, max_token: int = 600) -> str ` min_token: The minimum tokens of a paragraph that is not “dialog” or “quote”, -1 means don’t tokenize the txt (so it will be faster, but less efficient cleaning). max_token: The maximum tokens of a paragraph.

it will mark deleted paragraphs with: [deleted]

## Author

  • Peyman Mohseni kiasari

## License

This project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gutenberg_cleaner-0.1.6.tar.gz (5.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gutenberg_cleaner-0.1.6-py3-none-any.whl (7.4 kB view details)

Uploaded Python 3

File details

Details for the file gutenberg_cleaner-0.1.6.tar.gz.

File metadata

  • Download URL: gutenberg_cleaner-0.1.6.tar.gz
  • Upload date:
  • Size: 5.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.19.1 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.7

File hashes

Hashes for gutenberg_cleaner-0.1.6.tar.gz
Algorithm Hash digest
SHA256 1f54ea893d5c31a42cdd9fccf083956ac1a1f9f722b1385569f1d7bca319395d
MD5 2d896ec3cbe0c612b9432df9bddcdbb4
BLAKE2b-256 34c5c73ebc4def0f0ea222a25143dce37bfb677abd98ccbcb92de141980a1ff1

See more details on using hashes here.

File details

Details for the file gutenberg_cleaner-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: gutenberg_cleaner-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 7.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.19.1 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.7

File hashes

Hashes for gutenberg_cleaner-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 6d0c2cd095087ada346f6836df99bc7aa01af2833cb322fc0414e821995e8e01
MD5 8636a4b12ef512f9d208dcbefc902cfb
BLAKE2b-256 d4113b83da7620e9c05f48fb4791ef712791fabeda09c78c5200d0860ce1e97e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page