Skip to main content

cleans gutenberg dataset books

Project description

![](https://i.ibb.co/sCJXhmz/header-sp.png) ![](https://img.shields.io/apm/l/vim-mode.svg)

# gutenberg-cleaner

a python package for cleaning Gutenberg books and dataset.

### Prerequisites nltk package

### Installing ` [sudo] pip install gutenberg-cleaner `

## How to use it?

it has two methods called “simple_cleaner” and “super_cleaner”. ### simple_claner: Just removes lines that are part of the Project Gutenberg header or footer. Doesnt go deeply in the text to remove other things like titles or footnotes or etc… ` simple_cleaner(book: str) -> str ` ### super_cleaner: Super clean the book (titles, footnotes, images, book information, etc.). may delete some good lines too. ` super_cleaner(book: str, min_token: int = 5, max_token: int = 600) -> str ` min_token: The minimum tokens of a paragraph that is not “dialog” or “quote”, -1 means don’t tokenize the txt (so it will be faster, but less efficient cleaning). max_token: The maximum tokens of a paragraph.

it will mark deleted paragraphs with: [deleted]

## Author

  • Peyman Mohseni kiasari

## License

This project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gutenberg_cleaner-0.1.2.tar.gz (5.0 kB view hashes)

Uploaded Source

Built Distribution

gutenberg_cleaner-0.1.2-py3-none-any.whl (7.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page