Python package for converting xml and epubs to text files
Project description
epub conversion
---------------
Create text corpuses using epubs and wiki dumps.
This is a python package with a Converter for epub and xml (wiki dumps) to text, lines, or Python generators.
Usage:
------
### Epub usage
#### Book by book
To convert epubs to text files, usage is straightforward. First create a converter object:
converter = Converter("my_ebooks_folder/")
Then using this converter let's concatenate all the text within the ebooks into a single mega text file:
converter.convert("my_succinct_text_file.gz")
#### Line by line
You can also proceed line by line:
from epub_conversion.utils import open_book
book = open_book("twilight.epub")
lines = convert_epub_to_lines(book)
### Wikidump usage
#### Redirections
Suppose you are interested in all redirections in a given Wikipedia dump file
that is still compressed, then you can access the dump as follows:
wiki = epub_conversion.wiki_decoder.almost_smart_open("enwiki.bz2")
Taking this dump as our **input** let us now use a generator to output all pairs of `title` and `redirection title` in this dump:
redirections = {redirect_from:redirect_to
for redirect_from, redirect_to in epub_conversion.wiki_decoder.get_redirection_list(wiki)
}
#### Page text
Suppose you are interested in the lines within each page's text section only, then:
for line in epub_conversion.wiki_decoder.convert_wiki_to_lines(wiki):
process_line( line )
See Also:
---------
* [Wikipedia NER](https://github.com/JonathanRaiman/wikipedia_ner) a Python module that uses `epub_conversion` to process Wikipedia dumps and output only the lines that contain page to page links, with the link anchor texts extracted, and all markup removed.
---------------
Create text corpuses using epubs and wiki dumps.
This is a python package with a Converter for epub and xml (wiki dumps) to text, lines, or Python generators.
Usage:
------
### Epub usage
#### Book by book
To convert epubs to text files, usage is straightforward. First create a converter object:
converter = Converter("my_ebooks_folder/")
Then using this converter let's concatenate all the text within the ebooks into a single mega text file:
converter.convert("my_succinct_text_file.gz")
#### Line by line
You can also proceed line by line:
from epub_conversion.utils import open_book
book = open_book("twilight.epub")
lines = convert_epub_to_lines(book)
### Wikidump usage
#### Redirections
Suppose you are interested in all redirections in a given Wikipedia dump file
that is still compressed, then you can access the dump as follows:
wiki = epub_conversion.wiki_decoder.almost_smart_open("enwiki.bz2")
Taking this dump as our **input** let us now use a generator to output all pairs of `title` and `redirection title` in this dump:
redirections = {redirect_from:redirect_to
for redirect_from, redirect_to in epub_conversion.wiki_decoder.get_redirection_list(wiki)
}
#### Page text
Suppose you are interested in the lines within each page's text section only, then:
for line in epub_conversion.wiki_decoder.convert_wiki_to_lines(wiki):
process_line( line )
See Also:
---------
* [Wikipedia NER](https://github.com/JonathanRaiman/wikipedia_ner) a Python module that uses `epub_conversion` to process Wikipedia dumps and output only the lines that contain page to page links, with the link anchor texts extracted, and all markup removed.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
epub-conversion-1.0.5.tar.gz
(5.7 kB
view details)
File details
Details for the file epub-conversion-1.0.5.tar.gz
.
File metadata
- Download URL: epub-conversion-1.0.5.tar.gz
- Upload date:
- Size: 5.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cf039be8abf112b2eb8faee252afe783c6ca913f60f79546f03b3cf4bb64c893 |
|
MD5 | e6eca75350c77f660d5e8284dcb7423d |
|
BLAKE2b-256 | 482042693b1e6e981f9bd640cf4c0386eadf25ea0dc5107bdff53368084240a8 |