Skip to main content
Join the official 2020 Python Developers SurveyStart the survey!

cleans gutenberg dataset books

Project description

![](https://i.ibb.co/sCJXhmz/header-sp.png) ![](https://img.shields.io/apm/l/vim-mode.svg)

# gutenberg-cleaner

a python package for cleaning Gutenberg books and dataset.

### Prerequisites nltk package

### Installing ` [sudo] pip install gutenberg-cleaner `

## How to use it?

it has two methods called “simple_cleaner” and “super_cleaner”. ### simple_claner: Just removes lines that are part of the Project Gutenberg header or footer. Doesnt go deeply in the text to remove other things like titles or footnotes or etc… ` simple_cleaner(book: str) -> str ` ### super_cleaner: Super clean the book (titles, footnotes, images, book information, etc.). may delete some good lines too. ` super_cleaner(book: str, min_token: int = 5, max_token: int = 600) -> str ` min_token: The minimum tokens of a paragraph that is not “dialog” or “quote”, -1 means don’t tokenize the txt (so it will be faster, but less efficient cleaning). max_token: The maximum tokens of a paragraph.

it will mark deleted paragraphs with: [deleted]

## Author

  • Peyman Mohseni kiasari

## License

This project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for gutenberg-cleaner, version 0.1.0
Filename, size File type Python version Upload date Hashes
Filename, size gutenberg_cleaner-0.1.0-py3-none-any.whl (3.6 kB) File type Wheel Python version py3 Upload date Hashes View
Filename, size gutenberg_cleaner-0.1.0.tar.gz (2.3 kB) File type Source Python version None Upload date Hashes View

Supported by

Pingdom Pingdom Monitoring Google Google Object Storage and Download Analytics Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page