cleans gutenberg dataset books
Project description
![](https://i.ibb.co/sCJXhmz/header-sp.png) ![](https://img.shields.io/apm/l/vim-mode.svg)
# gutenberg-cleaner
a python package for cleaning Gutenberg books and dataset.
### Prerequisites nltk package
### Installing ` [sudo] pip install gutenberg-cleaner `
## How to use it?
it has two methods called “simple_cleaner” and “super_cleaner”. ### simple_claner: Just removes lines that are part of the Project Gutenberg header or footer. Doesnt go deeply in the text to remove other things like titles or footnotes or etc… ` simple_cleaner(book: str) -> str ` ### super_cleaner: Super clean the book (titles, footnotes, images, book information, etc.). may delete some good lines too. ` super_cleaner(book: str, min_token: int = 5, max_token: int = 600) -> str ` min_token: The minimum tokens of a paragraph that is not “dialog” or “quote”, -1 means don’t tokenize the txt (so it will be faster, but less efficient cleaning). max_token: The maximum tokens of a paragraph.
it will mark deleted paragraphs with: [deleted]
## Author
Peyman Mohseni kiasari
## License
This project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for gutenberg_cleaner-0.1.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0166df88c800346df7948c56d48803a2976f4b8b2a2ddabcb04ab9fecb496dfc |
|
MD5 | 73ca0b57b0cbe8789d17cb97cf7fadc3 |
|
BLAKE2b-256 | 4d25a13a1f8c6d5e13b0d0761be0babe4faed6b19d6a3d4c830cf73806ece1e9 |