cleans gutenberg dataset books
a python package for cleaning Gutenberg books and dataset.
### Prerequisites nltk package
### Installing ` [sudo] pip install gutenberg-cleaner `
## How to use it?
it has two methods called “simple_cleaner” and “super_cleaner”. ### simple_claner: Just removes lines that are part of the Project Gutenberg header or footer. Doesnt go deeply in the text to remove other things like titles or footnotes or etc… ` simple_cleaner(book: str) -> str ` ### super_cleaner: Super clean the book (titles, footnotes, images, book information, etc.). may delete some good lines too. ` super_cleaner(book: str, min_token: int = 5, max_token: int = 600) -> str ` min_token: The minimum tokens of a paragraph that is not “dialog” or “quote”, -1 means don’t tokenize the txt (so it will be faster, but less efficient cleaning). max_token: The maximum tokens of a paragraph.
it will mark deleted paragraphs with: [deleted]
- Peyman Mohseni kiasari
This project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size gutenberg_cleaner-0.1.4-py3-none-any.whl (7.3 kB)||File type Wheel||Python version py3||Upload date||Hashes View|
|Filename, size gutenberg_cleaner-0.1.4.tar.gz (5.0 kB)||File type Source||Python version None||Upload date||Hashes View|
Hashes for gutenberg_cleaner-0.1.4-py3-none-any.whl